Stan++/Stan3 Preliminary Design

Bob_Carpenter · May 9, 2018, 5:48pm

We (Matthijs Vákár, Maria Gorinova, Mitzi Morris, Sean Talts, and Bob Carpenter) had a series of meetings last week about new directions for the Stan language based on Maria’s thesis on SlicStan (also a StanCon 2018 and POPL probabilistic programming languages workshop presentation). As a refresher, what’s nice about Maria’s approach in SlicStan is that it allows users to write down generative models in the order in which the variables arise and then is able to infer the block structure based only on which variables are declared to be data. SlicStan also allowed parameters to be introduced in the bodies of functions, which enables things like non-centered reparameterizations to be carried out in a function rather than cut-and-pasted across multiple blocks.

There were several drawbacks to the original SlicStan, including that functions had to be unrolled statically and there were no conditionals or while loops.

There was also no naming convention for parameters introduced inside of functions—they awkwardly had features of both global and local scope.

We met intensively to try to figure out how to do some of this stuff more cleanly. I talked about using something like macros with naming, which the programming language specialists shot down as ridiculously amateurish (they were more polite than that).

Matthijs was also struggling with the fact that these submodels or functions should provide both namespace-like and scope-like properties. Sean suggested pulling out this notion of submodel that can introduce parameters and separating it from ordinary functions.

Matthijs suggested an object-oriented design. Mitzi and I respectfully noted that most of our users don’t know objects from functions and would be very confused at manipulating constructors explicitly.

Throughout, Matthijs was stressing compositionality, which he roughly described as saying that the semantics of a program should be function, and that the bodies of submodels should behave just like models themselves. Basically, we can take the interpretation of a Stan program (or statement) S, written [[S]] to be a function on environments (mappings from variables to values) such that denotations compose for sequencing,

[[S_1; S_2]] = [[S_2]] \circ [[S_1]]

(Sorry, no proper denotation symbol built into LaTeX.)

I wanted arguments to what was essentially going to be a constructor, but I didn’t want any methods. Then it dawned on all of us seemingly at once we could just move the pieces around on the board (literally the board, but the moving was done with chalk and eraser).

To satisfy all this, we came up with a separate submodel design that has aspects of object-oriented design that should be relatively transparent and natural to users. I ran it by Andrew and Aki and they didn’t seem to object. (Nor did they whoop for joy, as I was somehow expecting. Ironically, really hard design problems that eventually get solved wind up looking too easy or even obvious in retrospect. Anyway, the users will thank us because the artifacts will all fit together smoothly.)

Example model

Now onto examples. I’m going to use a linear regression where the coefficients from a hierarchical model (it’s non-standard, but an easy example to illustrate what this is all about).

\mu^{\beta} \sim \mathsf{Normal}(0, 5)

\sigma^{\beta} \sim \mathsf{Normal}(0, 5)

\beta \sim \mathsf{Normal}(\mu^{\beta}, \sigma^{\beta})

\sigma^y \sim \mathsf{Normal}(0, 10)

y \sim \mathsf{Normal}(x \cdot \beta, \sigma^y)

Blocked, unfolded

The first version will explicitly declare variables to be either parameters or data or generated quantities.

param real mu_beta ~ normal(0, 5);
param real<lower = 0> sigma_beta ~ normal(0, 5);
data  int<lower = 0> K;
param vector[K] beta_z ~ normal(0, 1);
      vector beta = mu_beta + sigma_beta * beta_z;
data  int<lower = 0> N;
data  vector[N, K] x;
param real<lower = 0> sigma_y ~ normal(0, 10);
data  vector[N] y ~ normal(x * beta, sigma_y);

data int<lower = 0> N_tilde;
data matrix[N_tilde, K] x_tilde;
gq   vector[N_new] y_tilde  = normal_rng(x_tilde * beta, sigma_y);

Unblocked, unfolded

The beauty of SlicStan is that it can infer the block level for variables by only marking the data variables (it could’ve alternatively marked the parameter variables). Variables that are assigned must be either transformed data or transformed parameters or generated quantities.

We’re going to go one step further and follow the grand BUGS tradition of letting the data tell us which variables are data. This isn’t necessary, but it will open up the possibility of using the same Stan program to run a model forward (from parameters to data) for simulation or in reverse (from data to parameters) for inference. We just erase the optional block qualifiers.

real mu_beta ~ normal(0, 5);
real<lower = 0> sigma_beta ~ normal(0, 5);
int<lower = 0> K;
vector[K] beta_z ~ normal(0, 1);
vector beta = mu_beta + sigma_beta * beta_z;
int<lower = 0> N;
vector[N, K] x;
real<lower = 0> sigma_y ~ normal(0, 10);
vector[N] y ~ normal(x * beta, sigma_y);
int<lower = 0> N_tilde;
matrix[N_tilde, K] x_tilde;
generated quantity vector[N_new] y_tilde  = normal_rng(x_tilde * beta, sigma_y);

Unblocked, folded with submodels

Submodels will be a structure to introduce more model. They may have declared arguments that are passed into them (rather than declared locally). For example, the hierarchical normal prior with non-centered parameterization will look like this:

model hier_normal_ncp() {
  real mu ~ normal(0, 5);
  real<lower = 0> sigma ~ normal(0, 5);
  int<lower = 0> K;
  vector[K] beta_z ~ normal(0, 1);
  vector beta = mu + sigma * beta_z;
}

The rest of the code involves a function for regression, which takes the regression coefficients as an argument (and uses their size to determine the size of the data matrix x):

model linear_regression(vector beta) {
  int<lower = 0> N;
  vector[N, size(beta)] x;
  real<lower = 0> sigma ~ normal(0, 10);
  vector[N] y = normal(x * beta, sigma);
}

Prediction takes two arguments for the coefficients and the error scale. Both are needed to make calibrated predictions.

model lr_predict(vector beta, real sigma) {
  int<lower = 0> N;
  vector[N, sizeof(beta)] x;
  vector[N] y = normal_rng(x * beta, sigma);
}

Then they’re just instantiated to provide a model.

model coeffs = hier_normal_ncp();
model lr = linear_regression(coeffs.beta);
model hat = lr_predict(coeffs.beta, lr.sigma);

Notice how the structure defined for the submodels (like coeffs) allows meaningful naming in regression (coeffs.beta). We’ll go with standard object-oriented syntax. What’s really happening here is that each model is an object and the . is just a member variable accessor. All the variables declared in the submodel structure are member variables, just like all the variables in a top-level Stan program.

We’ve achieved the compositionality that Matthijs was after with an implicit form of his OO design.

To do: inferring GQs

One thing we didn’t figure out is how to infer that a distribution can be implemented with an RNG in the generated quantities block. This is going to be an obstacle to reversing models that involve discrete data because we can’t generate them in Stan because we don’t have discrete parameters. But if we could recognize they could be generated with RNGs as in the generated quantities, we’d be home free.

What we do know is that the parameters \alpha can be generated as generated quantities if the posterior factors as

p(\alpha, \theta \mid y) = p(\theta \mid y) \cdot p(\alpha \mid \theta).

But we’re not sure how to do that given the Turing completeness of Stan’s imperative language for statements.

Limited graphical sublanguage

For graphical models, we can infer exactly this kind of conditional independence. So it provides even more motivation for rethinking graphical model inference.

Architecture

We’re definitely going to build a proper intermediate AST this time. My trying to roll together a C++ well-typed AST (with variant types) was me being more theoretical than practical. Yes, it works, but it’s not easy to work with. Instead, Sean’s convinced us to go to a more s-expression like intermediate representation (think Lisp with the function name being the root node in a tree).

The sky’s the limit while we’re working on paper. Those whacky programming language mavens want to work in F# or OCaml. I have to say that Maria’s thesis code looks awfully clean (it’s in F# and there’s a Lex/Yacc variant that makes the grammars look like pure BNF). We’ve found Spirit Qi to be rather rough going in its demands on C++ and on programmers. But not working in C++ is going to present an obstacle to releasing in R or Python.

We also spent some time talking about redesigning the model concept for better I/O. But that’s another post.

sakrejda · May 9, 2018, 6:35pm

Just wanted to say congrats on slogging through the details, even if it looks easy at the end. I have complaints but they’re minor and I’d like to think through them before blabbing. Anyway; +1

jonah · May 9, 2018, 6:40pm

This all sounds great! I only have one concern:

This is one of the things that currently makes Stan programs so much easier to read than BUGS code, at least for me. I find that it is very confusing not to know what the data is when looking at the code. It’s probably a net positive to make this change given all the possibilities it opens up, but I do think it’s a big loss.

avehtari · May 9, 2018, 7:25pm

I said “Cool”, which said by a Finn is about the same as “Awesome! Whoop, whoop!” by an American.
I was also about to leave to the airport, but during my flight I had a dream about several levels in a hierarchical model.

avehtari · May 9, 2018, 7:35pm

It seems your example is missing something or I don’t understand it. Both linear_regression and lr_predict have vector[N] y, which is ok as they have their own scopes? But instantiation doesn’t have any reference to ‘y’, so how y and y_tilde in the previous codes come in to the play?

In this

matrix[N_tilde, K] x_tilde;
generated quantity vector[N_new] y_tilde  = normal_rng(x_tilde * beta, sigma_y);

automatic detection of what is data seems dangerous, because if user forgets to define x_tilde it is sampled from uniform?

Bob_Carpenter · May 9, 2018, 7:56pm

As you can see from the first example, you can specify what’s data if you want.

If we allow data { ... } to just put a data qualifier on everything in the braces, the whole thing will wind up being backward compatible. Maybe I should’ve led with that :-) Revising.

Thanks for the translation. It’ll be useful for Helsinki this summer.

Yes, it would become a parameter with an implicit (improper) uniform prior. Dangerous, indeed.

Bob_Carpenter · May 9, 2018, 7:57pm

I should’ve said that this is also what would let us do the BUGS missing data trick where some of the entries in a data structure are missing. But without a model for x, it’s pretty useless.

sakrejda · May 9, 2018, 8:04pm

Users will post unreadable models and expect help, what we’ll need is a tool that can label a model.

betanalpha · May 9, 2018, 8:48pm

Compositionality is a fine goal, especially in theory, but it requires compromising other language features and before going too far into a language redesign I think it is critical to understand how best to make this compromise.

Stan abstracts the model from the computation (which already sets us apart from other probabilistic programming languages), leaving the language responsible for specifying a probabilistic model. The language currently does not differentiate between any sense of prior or likelihoods – it defines a joint model and then separates out parameters and data on which the joint is conditioned using the respective blocks. Consequently each Stan program defines the signature of the joint density in the data and parameters blocks and then implements the joint density itself in the model block. Underneath the hood Stan takes care of estimating the resulting conditional expectation values.

I love love love this abstraction. Firstly as @jonah pointed out it makes it extremely clear what the relevant spaces are just by looking at the preliminary blocks, facilitating advice on the forums and other media. It also forces users to think about the scope of their models which sets them up for more robust model building. Finally having to explicitly state which variables are data in one place provides a guard against malformed data files where certain parameters are accidentally missing – as someone who does a fair bit of teaching and consulting and has to deal with other people’s code a lot this feature is amazing.

If we embrace compositionality as a priority, however, then we lose this explicit signature. Instead the data and parameter spaces will be defined only after the full model has been specified and all of the variables can be aggregated together. Corresponding data and parameter blocks can be derived from such a model, but as they would be optional products of a Stan program, and not integral to the program itself, we would have to ask people to send them along or constantly generate them whenever we wanted to get a feeling for the full program.

Perhaps one way of saying this is that the proposed language defines model components and the full model is generated only implicitly during the parsing phase. A representation of the full model could be dumped but again it would not be integral to the program itself. This allows for the desired compositionality at the expense of abstracting the language one level deeper and losing the more explicit picture of the full model.

Finally there is a subtle computational issue here. Compositionality is not consistent with computation. We can’t compute using two submodels and then compose those outputs into the computation of the joint – we have to compose the models and then do computation only afterwards (after the joint has been defined, too). This, I believe, is why inferring generated quantities will be infeasible. Generated quantities are push forwards of the posterior distribution which can calculated in closed form if we use sample-based representations, but only once we’ve done the computation of the joint model. We cannot define generated quantities unambiguously until the joint model has been defined – the computation breaks the natural compositionality here which is why they’re so much more natural to define after the fact, taking the joint model parameter space as inputs.

Sorry, this was a lot but I’ve been thinking about these issues quite a bit, especially as we’re starting to converge on a principled Bayesian workflow and setup really nice pedagogy. To summarize I argue that we have the natural abstractions

(1) model components -> (2) joint model + data -> (3) computation of conditional expectations, generated quantities, etc

Right now the language defines (2) explicitly which makes the scope of the joint model explicit and immediate. I see the proposed language as moving backwards to define (1), leaving the parser to generate (2) in preparation for (3). My opinion is that abstracting (2) away from the language will make it harder to develop and debug models. The benefits may very well out-weight the costs for most people, but the costs have to be considered. And indeed there may yet be a compromise that reduces the compromise.

Matthijs · May 9, 2018, 9:39pm

We were discussing adding a pretty printing function for this purpose, which - provided that you have labelled your data variables as such will “elaborate” all submodels (expand the definitions) and “shred” the resulting Stan 3 program into the blocks of the Stan 2 syntax and pretty print it.
The idea that this is possible is at the core of Maria’s MSc thesis and it can be extended to our current design of Stan 3.
If I understand it correctly, the original motivation for the block syntax was to aid legibility of models and I agree that it’s fantastic for communication. However, it can be a pain for development. Specifically, it makes it hard to develop a standard library for Stan, to write modular programs, to use high level abstractions and hide unnecessary implementation details (like a non-centred parameterised normal) and reuse code.
Therefore, our idea for Stan 3 is that you develop your model in the new modular style – or in the old block style if you prefer, as it should be very easy to make the system backwards compatible – and once you want to present your entire model, you pretty print it to block syntax. Does that make sense?

For example, the model above should be pretty printed to the following Stan 2 code, which is also how you would write it now, I suppose:

data {
  int<lower = 0> coeffs.K;
  int<lower = 0> lr.N;
  vector[lr.N, coeffs.K] lr.x;
  int<lower = 0> hat.N;
  matrix[hat.N, coeffs.K] hat.x;
  vector[lr.N] lr.y;
}

parameters {
  real coeffs.mu;
  real<lower = 0> coeffs.sigma;
  vector[coeffs.K] coeffs.beta_z;
  real<lower = 0> lr.sigma;
}

transformed parameters {
   vector coeffs.beta = coeffs.mu + coeffs.sigma * coeffs.beta_z;
}

model {
  lr.y ~ normal(lr.x * coeffs.beta, lr.sigma);
  coeffs.mu ~ normal(0, 5);
  coeffs.sigma ~ normal(0, 5);
  coeffs.beta_z ~ normal(0, 1);
  lr.sigma ~ normal(0, 10);
}

generated quantities {
  vector[hat.N] hat.y  = normal_rng(hat.x * coeffs.beta, lr.sigma);
}

Charles · May 9, 2018, 9:45pm

I’ve been trying to resist replying to your comment, but it’s just too interesting, so I can’t seem to stop myself from writing a reply that’s longer than your original comment.

A bit of an aside: I’m not sure if this is a fair characterization of other probabilistic programming languages. It seems to treat “computation” in too monolithic a way (it’s unfortunate that there doesn’t seem to be a crisp term in statistics for “computing marginal distributions”). All probabilistic programming languages abstract the specification of a joint distribution of interest from the specific algorithms used to compute marginal distributions of the model, just as Stan does.

Indeed, it could be argued that this separation is what causes the major difficulties with current technologies. In my view (I think you might disagree), the syntax of Stan is much “closer to the metal” of what its computation does, which is one of the major reasons for its success.

Nor does the proposed design.

This isn’t an inherent feature of compositionality. It wouldn’t be hard to allow data declarations at the top of the program, which might reference variables declared in submodels. If you wanted, you could even require data declarations for “top level” Stan code, whatever you wanted that to mean. I don’t have an opinion right now about which way is better pedagogically or in terms of code readability.

As a general motto, this cannot be correct, because programming language semantics is inherently compositional wrt the syntax. I assume that you mean this in a more specific way, which could be stated (far less felicitously than you have) as: Computation of marginal distributions cannot be decomposed with respect to the variables in the model. This is true in general, and an important insight, but of course part of the point of probabilistic programming is that there are lots of important special cases where this is not true. i.e. If your model has a known conditional independence structure, then you can decompose the computation of marginal distributions according to specially chosen subsets of variables.

Actually, the difference between the “parameters” and “generated quantities” is a way in which Stan currently exposes details of the computation which have nothing to do with the model. After all, the local variables defined in genquant are random. Really the Markov chain simulated by Stan is (params, genquants). Moving genquants into params does not affect the target joint distribution if it is viewed in this augmented way.

On the other hand, if we ignore the declarations, any random variable v in the model which is not data and for which p(v | data, stuff left in model) is easy to compute can be moved to genquant.

I am hopeful about inferring genquants from unannotated code, but I suppose it might turn out that this problem is too difficult for static analysis and you would then want to allow an explicit declaration.

tl;dr: “Statistical model” does not mean “density”. It means “joint distribution over data”.

(so maybe the aside at the beginning was not one…)

Matthijs · May 9, 2018, 10:18pm

One (the old y) will now be named lr.y the other (the old y_tilde) hat.y in the top level model. Does that make sense?

Similarly, for N and N_tilde. (Incidentally, I think there is a typo in Bob’s model and N_new should just be N_tilde, right?)

(Also see my expanded pretty printing example above.)

sakrejda · May 9, 2018, 10:19pm

That’s fantastic and addresses my concern, thanks for the example!

anon75146577 · May 9, 2018, 10:31pm

This needs to work without the labels as well. Personally, even having read the previous model, I couldn’t work out what x_tilde was supposed to be in the “Unblocked, Unfolded” version. I read the next line and couldn’t work out why it was being used undefined.

This sort of thing would be basically impossible to debug for most users (It’s hard enough given the current warning you get in RStan when you forget to pass one of the things in the data block, but here it would fail weirdly).

I probably have other comments, but I’ll think through them. The thing I like unconditionally is submodels part.

Edit to say: The thing that is most important to me is knowing what behaviour different, common, mistakes will trigger. I worry about things that fail silently or fail weirdly. I’m not sure that some of these things won’t do that.

anon75146577 · May 9, 2018, 10:51pm

I think the two clearest reasons to consider breaking the abstraction is facilitating fake-data simulation and facilitating SBC for complex models. The former, especially, is a big win from a work-flow point of view. For complex models that are under development, there are obvious advantages to not having to keep two different Stan files (or a Stan and a R file etc) in sync in order to do good workflow things.

Now, this current proposal can’t do that yet, but it’s definitely a thing that is on the minds of the design peeps.

What are your thoughts on the plusses and minuses of enabling this?

As I was writing a v fast response to Andrew’s blog post on Friday, I repeatedly forgot one parameter in the file. I didn’t think the current warning was particularly transparent. It may be possible to make a warning that is at least as good as the current one even without an explicit data block being required (but as @Bob_Carpenter pointed out, it is still a possibility and I would argue it’s still good practice in most situations)

I think you mean “Compositionality is not consistent with HMC”. (It’s obviously consistent with Gibbs and it could possibly be exploited in some approximate algorithms. A limited form is used in INLA.) But if we are abstracting specification from computation, why is this a consideration? What am I missing here?

Is this because if theta1 is defined in submodel 1 and theta2 is in submodel 2, then the generated quantity theta1 + theta2 breeaks compositionality somehow? Is it because instead of being written as [[S_1]]\circ [[S_2]] the program with generated quantities would be [[S_1]]\circ [[S_2]] + f([[S_1]],[[S_1]])? I don’t have a good grip on this concept. @Matthijs is this a problem?

wds15 · May 10, 2018, 7:55am

It sounds like the Stan language parser will know more about the structure of the model. Did you guys also consider how we can add automatic parallel computation based on this? The map_rect is great, but automating this process will be much better and with more knowledge on the structure of things this could be doable.

In addition, the possibility of closures for functions would be great. For example, in the current design functions could be defined after the transformed data block. If these data items defined up to this point could be made visible to user functions, then this would drastically simplify the use of map_rect since packing / unpacking of data would become obsolete.

Another bottleneck in Stan programs are the gradient calculation. So, for example, it would give ODE program a major speed bump if the user could supply the gradients of the ODE functor (which is usually a very simple algebraic expression). Self-speaking this would be useful to allow for any user defined function.

Matthijs · May 10, 2018, 9:40am

We talked about it briefly. It would certainly be a goal eventually, but probably not for version 3.0. The problem is that Stan code is imperative (as opposed to functional) hence parallelisation is a non-trivial task in general, though I’m sure some heuristics exist. I don’t know enough about this myself, but maybe we can come up with some fancy tricks.

Function types, closures and tuples are currently a goal for 3.1, rather than 3.0, though the language will definitely be designed to make it easy to add them.

We’ve talked about this. I cannot remember with certainty what the verdict was, but I think that ultimately most people felt like it may not be a priority as few users might actually make use of this functionality and it might be confusing. However, I agree that ultimately you probably want this sort of function definition with custom gradients in any mature differential programming language (like Stan).

avehtari · May 10, 2018, 10:08am

My impression is that the impact of those few users would be much wider than just their own work. User defined gradients in Stan language would make experimenting much faster, would speed-up many models, and eventually some functions with gradients might get implemented in C++.

martinmodrak · May 10, 2018, 10:09am

Hi, this looks very nice and is close to some of the things I’ve been thinking about privately. So good job! And great that you are sharing that with the community.

One thing that bothers me is the automatic passing of data to submodels. It feels way too magical :-) I would rather have something more verbose, but cleaner, e.g.:

data {
   int<lower=0> N;
   int<lower=0> K;
   vector[N] y;
   vector[N, K] x;
}

model coeffs = hier_normal_ncp(K);
model lr = linear_regression(N, y, coeffs.beta);
model hat = lr_predict(N, coeffs.beta, lr.sigma);

The thing I like is that now the submodels have a clean scope and interface and it is clear which changes to the submodel may break upstream models. It is also now explicit which data is shared between multiple submodels and which is separate.

One particular problem I have with the implicit passing of data is that it is IMHO unclear what happens when you instantiate an array of submodels. You might want to have some data separate and some data shared. A practical example I have in mind is from my models of gene regulation, where you might want to formulate separately a single target and then compose those to have arbitrary number of targets e.g.:

data {
   int<lower = 0> N_targets;
   int<lower = 0> N_time;
   vector[N_time] regulator_observed;
   vector[N_targets, N_time] targets_observed;
}

model gaussian_process(vector observed) { ... }
model gene_regulation(vector regulator_true, vector target_observed) {  .... }

model gp = gaussian_process(regulator_observed);
model reg[N_targets] = gene_regulation[];
for(i in 1:N_targets) {
   reg[i] = gene_regulation(gp.true, target_observed[i]); 
}

I understand this may be a bit extreme and fringe use case, but wanted to mention it for consideration.

I would also be a fan of making data declarations mandatory (either in a separate block or with a qualifier sprinkled here and there). I happen to be a big fan of the compiler being a bit of a cop, forcing you to be clear and in exchange discovering bugs on your behalf. I think declaring data explicitly will be friendlier to possible static analyses of the code, letting the compiler be smarter and more helpful for the developer.

My experience is that the time spent writing code is much shorter than time spent debugging, so increasing verbosity (writing time) but getting fewer bugs in return is usually good tradeoff.

wds15 · May 10, 2018, 10:20am

I do concur with Aki here. I mean Stan is being used a lot inside of R packages by now. These R package authors are likely willing to spend some time figuring out gradients provided in a user-friendly manner (no need to go to c++). Thus the number of users using this feature maybe small, but maybe they serve has a hub in a sense. Such a feature would take off a lot of pressure on getting things into stan-math as we could build function libraries entirely written in Stan. I bet we would get immediately a big Gaussian Process kernel library which would be so cool.

… but of course, this all is a matter of resource vs benefit; and I have no clue how hard this feature is to implement.

One more point: Right now user functions are great, but they force everything declared in them to be of type var. So with a more clever parser we should be able to avoid the over-propagation. I know this is detail and not design, but I wanted to mention it in this context.

Topic		Replies	Views
SlicStan Publicity	6	1908	September 10, 2017
Generalizing the Stan language for Stan 3 Developers features	56	4727	September 1, 2017
Prior initialization closer to declaration Developers	14	3141	February 24, 2017
Soliciting syntax ideas for user defined gradients and user defined transformations Developers	9	938	September 30, 2019
First-class functions, higher-order functions and closures General	5	995	April 18, 2018