Stan++/Stan3 Preliminary Design

#36

This is one of the reasons why I think people who have never personally written any compiled code before are not in a good position to learn the current Stan language, although this proposal perhaps makes this somewhat better. If you know some C++, then in retrospect it kind of makes these that the things in the data block are what the constructor needs and the log_prob method of the class takes in parameters and evaluates the log-kernel of the density function. But from the perspective of someone who only knows R, Python, etc. a Stan program does not look like a function that would be written in those languages:

foo <- function(arg1, arg2, ...) {
# do stuff
return(output)
}


It isn’t simply “Why does Stan require types, sizes, and semicolons?”, it is stuff like why

data {
int N;
int K;
}


isn’t written as

data {
int N = 1000;
int K = 5;
}


and the answer that N and K are things the constructor is expecting to be passed to it is not really satisfying for someone who does not know what a constructor is.

Anyway, having a submodel may help a bit but it is still going to be weird to them in some ways, i.e. seeing local data objects being used that are not passed to the function. That is not really an argument for or against the proposal, but a reminder that some of our workshops that go

1. Install a C++ toolchain
2. Do the 8 schools example
3. ???
4. Profit

make no sense to the people in the workshop.

#37

Thanks Bob for the thoughtful reply, I’ll try to clarify a bit:

The implicit passing of data I refer to is that when you have a model like:

model gaussian_process() {
int<lower = 0> N;
vector[N] times;
real length;
real sigma;
vector[N] y ~ multi_normal(0, cov_exp_quad(times, length, sigma));
}

gp = gaussian_process();
y ~ gp.y;


Your specification seems to assume, that this will automatically let me specify gp.times as data for the main model. This IMHO has multiple issues:

1. To understand what data/params/… a submodel provides you need to read its full code
2. You have to be careful when updating a submodel that’s in a library, as basically everything in the submodel is visible to the user (and by Hyrum’s law someone will depend on it)
3. If you use multiple copies of a submodel it is impossible to say which data is shared, e.g.:
gp1 = gaussian_process();
gp2 = gaussian_process();


requires me to specify both gp1.times and gp2.times as data for the main model, there is no way to have a single data item times and pass it to both gp1.times and gp2.times. I would prefer that the data passing be made explicit.

I believe that allowing two quite different programming styles has a huge cost as people need to understand both to be able to use the forums and other online advice effectively. I’d rather have only one way of writing things, even if it is the one I personally like less.

And one more philosophical point: I don’t think it is necessarily a bad sign when people come to forums with simple questions about basic misunderstandings (e.g. “Why do you have to declare data separately”, typing issues etc.). Consider two scenarios:

1. User tries to write a simple model, but struggles as their mental model of how Stan works is incorrect. User asks a simple question on the forums which is easy to understand and gets quickly answered. User is now better equipped to write complex models.
2. User tries to write a simple model, their mental model of Stan is incorrect, but they succeed because Stan performs a bit of magic behind the scenes. User proceeds to a more complex model and fail. They post a complicated and long model code on the forums. The question is hard to understand and takes more time to answer.

The R language is IMHO a typical example, where users succeed with simple things and then post impenetrable code on Stack Overflow asking for help.

I am not sure that this particular proposal moves us much towards the second scenario, but wanted to say that optimizing the language for “Hello world” examples is IMHO not very reasonable. It is IMHO more important that worst practices are hard than that best practices are easy.

Stan also has the advantage that for simple use cases, you have rstanarm and brms, so there isn’t really a reason to strongly cater to users that will only ever write simple models.

Well, that’s just like my opinion. You do you :-)

#38

Yes, that’s the point. You write the program as if the pyro.sample statement really does sampling, but it doesn’t. In reality, complicated algorithms under the hood are approximating posterior distributions, but the details of that computation are abstracted away from the specification of the model.

You can think of Stan programs that way as well, at least in the common case in which the program describes a generative model. You can pretend that ~ means “sample” as a way of understanding the generative model, as long as you’re aware that Stan isn’t really sampling from those distributions under the hood.

I don’t think this can be correct. e.g., In the example at top, you could report a precision tau_y = 1 / sigma_y**2 in the generated quantities. (Maybe I am misunderstanding your point?)

Ah, but almost every optimization in an optimizing compiler depends on the solution to a Turing-complete problem. The way the handle it is to come up with a conservative analysis. It might miss opportunities for optimization, but never in theory changes the semantics of a program.

#39

At least for the data, that’s how you define object members in OO languages like C++ and Java. But iI agree that the current Stan language is challenging for a lot of people, both programmers expecting one thing (something more composable like a programming language) and statisticians expecting another (though I’m not quite sure what that is—I think something without types like R would be a start, but we’re not quite to where we can do full automatic type inference because of overloading. We migth be able to work it out.

This isn’t part of the proposal.

There’s always something in the ??? bit. The classes we’ve done where statisticians have shown up have gone well. The ones done to more general audiences have been a disaster in my experience. I want to cut out trying to do those until we’ve figured out a more general intro to probability and stats and maybe some programming.

@martinmodrak Thanks for all the thoughtful comments! My reply may be even longer than your post as I’m going through several ways of coding things.

Let me rewrite your model a bit to make it well formed according to your model spec.

model gaussian_process() {
int<lower = 0> N;
vector[N] times;
real length;
real sigma;
vector[N] y ~ multi_normal(0, cov_exp_quad(times, length, sigma));
}

model gp = gaussian_process();


I’m not sure what you intended with the y ~ gp.y as that wouldn’t be a thing. Instead, just constructing the GP itself as above would introduce five variables, gp.N, …, gp.y. I wouldn’t say it’s automatic so much as I’d say when you create a submodel, the submodel’s elements are also created. These created elements are then accessed as gp.N, etc. But it’s the call to gaussian_process() that creates them.

You only need to read the declarations. You already have to do that now. So I don’t think it’s so much the number of things you have to read so much as that it’s not clear what is data and what is a parameter. The plan is to allow that to be explicit, as I showed in the very first example. So if you wanted, you could do this:

model gaussian_process() {
data int<lower = 0> N;
data vector[N] times;
param real length;
param real sigma;
data vector[N] y ~ multi_normal(0, cov_exp_quad(times, length, sigma));
}

data int<lower = 0> N;
data vector[N] times;
param real length;
param real sigma;
model gp = gaussian_process()


Now it looks almost like an existing model in terms of the order of declarations. And it makes it clear there are priors missing for length and sigma, assuming those are the free variables here. Those could either go on the outside, or on the inside of the submodel depending on how much you wanted to encapsulate in the submodel. The point is to start reasoning about the submodel as a unit. If the submodel is coherent, for example in defining a non-centered hierarchical prior or a full GP, then it can be easily used as a component in other models. That’s the main goal here.

It’s all encapsulated, so this is no more an issue than the name a developer used for a variable in an object like vector in C++. The user has named the submodel gp and all names that get introduced are relative to that object, and used as gp.X. This proposal actually makes this ability better. As is, all the names are completely global. So if you use four hierarchical models, you make yourself crazy writing down mu_beta_age and sigma_beta_age and the like for all the coefficients. This gives all the names structure.

Not quite. Nothing is shared between gp1 and gp2. In order to share something between these two, they’d need arguments and someting in a higher scope passed down to them. In order to share sigma, it all has to be done explicitly, just like in any other programming language:

model gaussian_process(real sigma) {
data int<lower = 0> N;
data vector[N] times;
param real length;
data vector[N] y ~ multi_normal(0, cov_exp_quad(times, length, sigma));
}

real<lower = 0> sigma ~ normal(0, 1);  // give it a prior
model gp1 = gaussian_process(sigma);
model gp2 = gaussian_process(sigma);


This is a serious concern and one I worry about in trying to maintain backward compatibility. The question’s whether people are going to keep wanting to write things both ways, or if we can wean them into new styles. This is essentially what happens with C++ — there are lots of ways to do things, but there are modern and antiquated ways to do them, with modern code being written in a more modern style.

If you think of (a) allowing data, parameter as qualifiers, as we’ve proposed, then all you need to do is think of data { ... } as way to specify the data qualifier on a group of variables.

So I hope it’s not that different in terms of style. But it will be an ongoing concern.

To me, it depends how much effort they put in before they got there. The ones that get me are the two extremes: “I spent two days wrestling with X” for something we could’ve answered quickly and the “here, I just guessed and it didn’t work, help me write my code one debug line at a time”. I also worry that when people show up to the forums with simple blocking issues that stop them from proceeding (as opposed to, say, some misunderstandings about operational issues), then I worry that really represents many many more people who got frustrated and walked away.

It’s very hard to build a tool that works for beginners and for experts. We’ve largely focused on the experts and are having some in-person discussions about how much we want to continue that focus versus try to address some more beginners issues.

It’s not just R. Anything involving code is like this. You can get a little ways by debugging by just apprehending everything at once, then you hit a complexity barrier that needs to be avoided through more engineering. It’s hard.

That’s just in R, so it’s invisible to a subset of our users. And brms and rstanarm cater to some particular simple classes of models (and some quite complex ones), but not all of them.

Not to speak for @betanalpha, but he and I have always been in agreement that the basic Stan language is actually a good tool for teaching people about the components of a Bayesian model. Partly because it’s easy to read a Stan model. I worry that this is going to change if we start writing more code-like examples. But they may actually be easier. I sort of prefer the hello world example:

param
real<lower = 0, upper = 1> theta ~ beta(1, 1);
data
int<lower = 0> N;
data
int<lower = 0, upper = 1> y[N] ~ bernoulli(theta);


to

data {
int<lower = 0> N;
int<lower = 0, upper = 1> y;
}
parameters {
real<lower = 0, upper = 1> theta;
}
model {
theta ~ beta(1, 1);
y ~ bernoulli(theta);
}


but I see the appeal of both versions. The former is short and follows the geneative story and links the declarations with the uses of the parameters and modeled data variables. The latter very clearly circumscribes the variables and their roles.

If it’s not, you should cite references. :-)

#40

Exactly how we do think about it. But it can be very confusing to users.

Given the parameter sigma, the target is independent of tau. So you can move it to generated quantities. If you used tau somewhere else in the model, say by writing y2 ~ normal(0, tau_y), then you wouldn’t be able to do that. But if you did it where the y2 itself isn’t data, say to do posterior predictive replications, then you can again move it to generated quantities because now the target doesn’t depend on y2 or tau_y given sigma. And when I say “depend” here, I mean in terms of whether it’s getting any useful information about the parameters. Let me write out a full example:

parameter real mu ~ normal(0, 1);
data int N;
data vector[N] y ~ normal(mu, 1);
data int N_sim;
parameter vector[N_sim] y_sim ~ normal(mu, 1);


In this case the parameters are mu and y_sim and the posterior factors as

p(\mu, y^{\mathrm sim} \mid y) \\ \ \ \ \ { } = p(\mu \mid y) \cdot p(y^{\mathrm sim} | \mu) \\ \ \ \ \ { } \propto p(\mu) \cdot p(y \mid \mu) \cdot p(y^{\mathrm sim} | \mu).

BUGS, for example, can do this automatically, based on analysis of the Markov blanket of variables.

Exactly. So the question is whether we can detect useful cases locally.

An alternative is to consider a directed graphical modeling sublanguage which is no longer Turing equivalent, but lets you calculate a bunch of this stuff. I think we might also usefully be able to do a lot of hotspot runtime unfolding based on the data that’s actaully observed.

#41

Thanks for the response. I understand that this is hard and I am in the comfortable position of having an opinion but not having any responsibility for the decision :-)

However, I think my main point still got lost in translation, giving it another try :-)

But how does the compiler determine that times in the main model maps to gp.times (the submodel?). That is the magic that worries me. I’d rather have to write something like

model gp = gaussian_process(N, times, length, sigma, y);


or even with named parameters to avoid mismatches in long param list.

model gp = gaussian_process(
N = N,
times = times,
length = length,
sigma = sigma,
y = y
);


This way it is obvious what gets passed where and I can be precise what is shared and what is different in each submodel. For example there is now a clear distinction between:

model gp1 = gaussian_process(N, times, length, sigma, y1);
model gp2 = gaussian_process(N, times, length, sigma, y2);


and

model gp1 = gaussian_process(N, times1, length1, sigma, y1);
model gp2 = gaussian_process(M, times2, length2, sigma, y2);


Hope that clarifies my concern :-)

#42

I don’t think the proposal aids in this, but a new compiler implementation will. I tried to convince everyone over the week that we could do a lot in this realm, but I don’t think I managed. The idea is basically the same idea as in this type of CRDT: https://en.wikipedia.org/wiki/Conflict-free_replicated_data_type#Operation-based_CRDTs

Summarized, each AST subtree ending with a target += can be analyzed for information dependencies (both in and out) and if we can conservatively partition the model into subtrees that don’t interact, we can parallelize and then do all of the target +=s in whatever order we want.

#43

Acrually R users should be quite familiar with something like

gp = gaussian_process()


without having to pass everything via argument. For an R user this is similar to what happens when variables exist in the global environment or in the environment of a wrapper function around gaussian_process(). However, with a few exceptions, that’s considered bad practice in R, so they will be familiar with it but used to being discouraged from doing it.

#44

That example I had didn’t make sense. I didn’t mean to cut all the declarations out, it should’ve been this:

model gaussian_process() {
data int<lower = 0> N;
data vector[N] times;
param real length;
param real sigma;
data vector[N] y ~ multi_normal(0, cov_exp_quad(times, length, sigma));
}

model gp = gaussian_process();


or

model gaussian_process(int N, vector times, real length, real sigma) {
data vector[N] y ~ multi_normal(0, cov_exp_quad(times, length, sigma));
}

data int<lower = 0> N;
data vector[N] times;
param real length;
param real sigma;
model gp = gaussian_process(N, times, length, sigma);


I know this kind of thing is possible, but I have no idea how easy it is to do or how general the heuristics we can come up with will be.

That wasn’t my intent! I’m not a fan of R’s dynamic lexical scope!

#45

For the most part me neither!

#46

I prefer the version where arguments are used to pass this stuff in. It does mean sometimes things have to get passed down deep but that’s a problem every language deals with.

#47

We need something like this urgently for Stan from my view! Stan is amazing and HMC is capable to attack really large problems, but only if you manage to use the raw computing power of many cores. With more knowledge on the model structure we should be able to figure out far more on what are independent computations which can be used to automate parallelism. My thinking was along very similar lines. This the target+= point in any Stan program is an obvious target for these optimizations.

If we cannot do a fully automatic thing, then maybe a syntax like

target += ... some clever syntax to specify independent contributions...

could be translated to a parallel execution facility. How that would look exactly is not yet clear to me.

Sebastian

#48

Maybe first thing would be to move AD graph to some network library object and then doing the analyses on that? Conditional statements are probably going to be tricky unless one just marks them as one node.

Would this help to speed up recursive algorithms, e.g. state-space go (Kalman filter stuff by Simo Särkkä et la)?

#49

I think we will need ways to allow users to manually override / specify parallelism-related ideas; it might help that I have been proposing adding a general metadata annotations system to the Stan 3 language to allow these sorts of AST decorations (e.g. something like @independent (or whatever word) on the target += statement; syntax TBD).

I have a bunch of other thoughts on parallelism that I should probably save for another thread, but I’m a fan of the kind of declarative SQL/Spark DataSet/DataFrame style for specifying what you want and then having the query engine figure out how to spread it out across machines. They’re basically just parallel collections of rows with operations like map and groupBy.

I think this will be much much easier to do at the Stan AST level than at the AD graph level, especially in our AD system. What is a network library object?

#50

I just meant that we could have a subprogram / external library doing the analysis part (connected graphs etc)

#51

That would make sense if we did static expression graph expansion like TensorFlow (before their eager mode) or symbolic differentiation like Theano. As is, it’s not worth optimizing the expression graph compared to just evaluating it.

To be able to expand Stan statically that way through autodiff, we’d need to forbid any conditioning on parameters and also remove any underlying functions that do this internally.

On the other hand, expanding the Stan program (the thing represented by the AST) is something that’s definitely worth doing.

I think that shold be the priority. Then we can think about how to automate it. I want to keep thinking of the Stan language itself more as a programming language.

I think that matches our use-case well.

We probably don’t want to start writing our own graph algorithms, but most of them aren’t that difficult to first order.

#52

For parallelizing it would just make sense to treat everything downstream of a parameter-based conditional as serial. Otherwise we’d be giving up the ability to auto-diff numerical algorithms which is too much of a cost to pay.

#53

I think it’s fair that we move the discussion of parallelization to a new thread, as it’s not directly related to the current language proposal.

Same for prior/likelihood separation which is not part of the current proposal and worthy of its own thread.

People have argued that the blocks are obstructive because they don’t allow statements like real x ~ normal(0, 1), but let me step back and ask what is that statement even supposed to mean? To the left of the ~ we introduce a variable x along with its type, cool, but what is the right hand side specifying?

It doesn’t specify the final distribution for x because we eventually condition on the observed variables. Okay, so let’s ignore the conditioning that happens later and focus just on the joint distribution we have before we separate out variables that are uncertain and variables that have specific values. In this case the statement might read “x is marginally distributed as a unit normal”, i.e. if we integrate out all of the other variables then x has the specified distribution. Except we can’t always say that – if we instead had real x ~ normal(mu, 1) then we can’t marginalize out mu and maintain the same interpretation. So we’d have to say something like “conditioned on the values of any variables on the right hand size, x marginally has the given distribution”.

But we can’t even say that because there’s no restriction on having terms later on like x ~ normal(5, 2) which modifies the distribution for x. So at the very best real x ~ normal(mu, 1) would read as “conditioned on a value for mu, and assuming x doesn’t show up anywhere else in the program on the left hand of a ~ or a |, integrating out all of the other variables but x and mu will leave x with the marginal distribution normal(mu, 1)”. It’s a bit of a mouthful.

The problem here is that the intuition for something like real x ~ normal(mu, 1) comes from thinking about graphical models where the statement about the marginal behavior conditioned on the child variables cannot be modified later on. In the context of purely graphical models, which everyone has been using for their examples of why real x ~ normal(mu, 1) is so great, the notation is solid but outside of that context the intuition falls apart and can be more dangerous than useful.

I personally don’t see how the notation can be rectified with a language more expressive than graphical models. If people want to drop back to graphical models then let’s have that discussion, but without that restriction real x ~ normal(mu, 1) is nowhere near as clean as has been presented.

Ultimately I think the subtle tension here is due to the fact that we’re specifying a function, and parameters and data are not intermediate variables and hence not amenable to the programming practices for intermediate variables such as “define near where you declare”. Serious question – is there any language that defines functions implicitly through annotated variables? Especially as languages go more functional and the theoretical perspectives more categorical, the signature of declared functions seems to be becoming more prominent in language designs (correct me if I’m wrong) as reasoning about the inputs and outputs before worrying about the implementation is an extremely helpful abstraction.

A consistent argument with the current Stan language is that it’s ungainly for large programs. To be fair I disagree here (and I often write programs with hundreds of lines) but in any case if one agrees that the current language is ungainly then compositionality would be a natural next step from the functional/categorical perspective. The problem is that to maintain the same expressiveness of the current language those submodels wouldn’t be self-contained – multiple submodels could modify the same variables – so the compositional structure just wouldn’t be there. We’d have hidden state, side effects, leaks, whatever you want to call it, unless we restricted the language to graphical models. How do we maintain all of the desired features of compositionality without changing the scope of the modeling language? Am I missing something here?

Then there’s the added subtlety that the compositionality is natural for specifying a joint distribution through a graphical model, but that same compositional structure no longer holds once we start conditioning on variables. I think this the reason for some of the questions/concerns about the global scope of the data variables – I’m not sure compositionality of the joint is compatible with declaring what variables we condition the joint on afterwards.

Again all of this issues have clean resolutions within the scope of graphical models (reducing to the BUGS model where you don’t specify conditioning variables in the program but rather by passing in values when you run that program) but I don’t think there are anything but limited heuristics outside of that scope, and that worries me in terms of language design.

I can’t figure out any way of having compositionality without having two stages of parsing. The first stage compiles components models together into a joint model amenable to global evaluation along with the list of all input variables. After this the global model is available for viewing (i.e. we generate a model block) and those input variables are separated out into bound (data) and unbound (parameters) variables. Then we run the second stage of the parser that transpiles into C++. This might also offer the opportunity to have a user-focused UX (where they just build the second program directly) verses a developer-focused UX (where the compositional model is specified).

Even then we’d have the odious task of figuring out a good UX for specifying the bound variables (as previously discussed, relying on what’s defined in the input data file is dangerous) and dealing with conflicts between the specification and what variables were defined in the previous model. Requiring that the bound variables be declared in the program itself is extraordinarily powerful.

#54

Wouldn’t this be just as easily resolved by changing data to something like external data and transformed data to something like internal data?

#55

I’m resorting this in terms of order of importance (to me, of course).

Therein lies the main tension we’re wrestling with, in my opinion.

It’s powerful, but its power comes from imposing limits. And when you write bigger programs, you run up against those limits.

Computer programs wind up being made up out of gazillions of little blocks. What we’re doing now is like forcing all the I/O to be fed through a single big main() function and restricting a no I/O policy on the subunits.

I’m thinking about this all very operationally. There’s a data context (mapping of variables to values [sorry, I’m pre-repeating myself as you’ll see later]) and a parameter context (ditto), and the data declarations read from the data context and the parameter declarations read from the parameter context. That’s how it’s implemented under the hood with readers in the constructor for data and in the log density function for the parameters.

That’s why I’m particularly concerned about separating the declarations of various data variables and parameter variables so that it’s not so clear any more what the function is. The way things are now, the signature of the data -> parameters -> log density function is laid out very clearly (modulo the implicit unconstraining transform and log Jacobian).

But I don’t see how that follows.

I’ve been looking at other probabilistic programming languages recently and I find they present the languages operationally, as if pyro.sample or pyro.observe are simple Python functions. But then they perform variational inference, which clearly isn’t doing anything like naive sampling. You can find this thinking encapsulated in a section title in the Pyro tutorials, “Inference in Pyro: From Stochastic Functions to Marginal Distributions.”

I have the same objections you (@betanalpha) do to this confusion. In Stan, we just say the program’s directly calculating a log density up to a constant. Looked at that way, it looks less like a probabilistic programming language and more like a function. And as I keep emphasizing, that’s still how I think of a Stan program—as a function from data to parameters to a log density.

The original motivation for using ~ came from BUGS, which used it to declare a graphical model. I’m not sure where the original motivation came from in stats, where people use it to lay out components of generative models. I borrowed it from BUGS with a different interpretation as an increment to the log density.

For the new compound declare-distribute syntax, the motivation for me comes not from BUGS, but from the compound declare-define statements in languages like C++. It’s just a convenient way of combining the declaration and use of a variable. Philosophically, I’m not trying to distinguish between

real x;
...
x ~ normal(5, 2);


and

real x ~ normal(5, 2);


other than that in the latter I don’t have to scan so far to find the use of the variable.

That’s what the spec will lay out. How it consistently computes a log density function.

Closures behave something like this. They grab their current context and produce a function. They’re essentially mappings from contexts (variable to value maps) to arguments to values.

In logical languages, you usually think about expressions or statements or other syntactic units having free variables, then things like quantifiers like for-all and there-exists bind them off. That intermediate structure with free variables can be thought of denotationally as a function from contexts (mappings from variables to values) into values. That’s essentially what a closure does, too.

I’m not sure what compositionality you’re talking about here. For the Stan 3 design, I think the issue is that the compositionality is one of variable contexts and target value, not of the joint density in any statistical sense. In order to compose program fragments, those fragments must be interpreted as relations between the contexts before and after the statement executes. This can all be compositional with declaring new variables—it’s the usual thing in programming languages to do that. Stan’s unusual in blocking off all the declarations at the top—that’s just because I was lazy in writing the parser, not out of any philosophical commitment to declarations-first.