Generalizing the Stan language for Stan 3

#1

I mentioned in the meeting yesterday that I’ve been thinking about how to redesign the Stan language from the ground up to allow some increased flexibility. Here’s my initial thoughts.

New Stan Language

This is a very tentative proposal for a new Stan language. I have a few goals in doing this:

1. wanting the model to follow the generative story
2. wanting to keep use of variables close to definition
3. wanting to support a non-centered prior or zero-inflated Poisson as a function

The syntax is somewhat motivated by PyMC3 and Edward, yet it retains Stan’s underlying commitment to strong static typing, natural probability and matrix syntax, freedom from restriction to graphical models, and modularity of model code from algorithmic concerns. I think I can make it generate a compatible C++ file, so it really is just a new syntax.

Currently, a Stan program is organized by data, parameters and model:

data {
int<lower = 0> N;
vector[N] x;
vector[N] Y;
}
transformed data {
vector[N] x_std = (x - mean(x)) / sd(x);
}
parameters {
real alpha;
real beta;
real<lower = 0> sigma_sq;
}
transformed parameters {
real<lower = 0> sigma = sqrt(sigma_sq);
}
model {
beta ~ normal(0, 1);
alpha ~ normal(0, 2);
sigma_sq ~ inv_gamma(1, 1);
y ~ normal(alpha + x * beta, sigma);
}
generated quantities {
real<lower = 0> tau = inv(sigma_sq);
}

Here’s what I’m thinking it could look like:

param real alpha ~ normal(0, 10);
param real beta ~ normal(0, 10);
param real(0, ) sigma_sq;
real(0, ) sigma = sqrt(sigma_sq);         // transformed param
real(0, ) tau = inv(sigma_sq);            // generated quantity
data int(0, ) N;
data vector[N] x;
vector[N] x_std = (x - mean(x)) / sd(x);  // transformed data
data int(0,1)[N] y ~ normal(alpha + beta * x_std, sigma);

It’s around 10 lines long rather than 25. It satisfies desiderata (1) and (2) above—follows the generative story and variables used closer to where they’re declared.

What it loses is the clear delineation of the types of variables you get in the original code. If you’d seen this one first, I’m not sure if it’d be easier or harder to read. I’m so used to our old notation!
The old notation was motivated by thinking about the variable declarations in the blocks as more or less like variable declarations to a function. For data, it’s the arguments to the constructor for the C++ — or the y in p(theta | y) using Andrew’s notation for theta parameters and y data. For parameters, it’s the argument theta to log p(theta | y) + const after y has been fixed—the log posterior, in other words. The generated quantities block is used to declare the posterior predictive quantities—that’s really a return type for the generated quantities block. Then, we added transformed data and transformed parameters and things got a little muddier, though those were again output types for data and parameter transforms.

Pulling out the compound declaration/sampling statements and reordering, the result looks like current Stan, with clear data, transformed data, parameter, transformed parameter model, and generated quantities blocks:

data int N;
data vector[N] x;
data int(0,1)[N] y;

vector[N] x_std = (x - mean(x)) / sd(x);

param real alpha;
param real beta;
param real(0, ) sigma_sq;

real(0, ) sigma = sqrt(sigma_sq);

alpha ~ normal(0, 10);
beta ~ normal(0, 2);
sigma_sq ~ inv_gamma(1, 1);
y ~ normal(alpha + x * beta, sigma);

real(0, ) tau = inv(sigma_sq);

If you think of our block headers like data { ... } as introducing that marker on all the contained variables, then we get backward compatibility. Maintaining two different ways to write Stan programs is probably a bad idea in the long run, though.

Only parameter and data variables are marked as such. Locations of other variables may be inferred:

• variable does not depend on any parameter:
• transformed data
• variable depends on some parameter and the target depends on variable:
• transformed parameter
• variable depends on some parameter and the target does not depend on variable:
• generated quantity

Locally scoped blocks are no problem, their location w.r.t. an original Stan program would be inferred similarly to how variable location is inferred. For example, the following would have to go in the model block:

{
real y = 10;
y ~ normal(mu, sigma);
}

Functions can be defined so that they have access to previously defined variables (this requires finding the right block and using closures to define the function).

A non-centered normal prior could be encoded like this, satisfying desideratum (3).

vector[N] beta = non_centered_normal(N, mu, sigma);

vector non_centered_normal(int N, real mu, real sigma) {
param vector[N] beta_std ~ normal(0, 1);
return mu + sigma * beta_std;
}

Technically, this is tricky because it’s introducing a parameter in the scope of a function, but if the parameters just get read off an iterator that gets passed in here through a closure, it should be no problem assuming there’s no branching that could change number of parameters. But if we do this, it also allows conditional parameters to be defined, and we could even roll it into the language in such a way as to avoid messy scope interpretation:

param x when (cond);

Again, just have to make sure that cond doesn’t depend on parameters. Still not sure I like making parameters conditional because there’s the tricky business of when they get used; they’re only available if the runtime cond holds.

SlicStan
#2

Wow.

#3

Trying to avoid the hot take mentality here.

1. I think we might need to do some serious leg work to avoid a Python3 style schism. People have legacy code at this point. I don’t know how feasible it is to avoid that issue.

2. The interesting thing about this proposal to me is that it brings Stan towards looking more like a regular program (though we still have a special state were are storing, i.e.-the log density). That’s really cool. Personally I could see it making my own model programming easier.

3. This is inspiring me to think about whether this couldn’t just be handled directly in C++ w/ a separate transpiling step. That’s OT here.

4. The model blocks have been useful when I go to explain Stan to people who have never heard of static typing. You can break it up a little for them in terms of presentation. I guess this just makes it a little harder to present to new Stan programmers but maybe that’s not a problem.

5. Another way of achieving goal 3 is to have modules that can be imported, and allow the modules to have their own parameter block (which would be merged with whatever main program parameter block there was). That would allow us to keep the current syntax and I think it would make this proposal harder to justify when weighted against the chaos it introduces. This alternative solution would be less elegant and require more code but would avoid any Python3 style schism.

This sketch was really helpful to me for seeing how this could look. Are there more arguments we could make for benefits of the new syntax?

#4

Ug: Without a separate transpiling step.

#5

I think we might need to do some serious leg work to avoid a Python3 style schism

As an active Python 2er, if my distro asked me to change, I would. But it doesn’t, so I don’t.

Personally I could see it making my own model programming easier.

Yes, this. For lack of better comparison this looks like Stan, Hadley edition or something.

1. The model blocks have been useful when I go to explain Stan to people who have never heard of static typing.

Might sound dumb, but what I know about statistics/Bayesian inference I learned hand-in-hand with Stan. I really like the model blocks in terms of explaining what’s going on. That was my only reservation when I saw the first new model (but I really liked the 2nd reordered model).

I’d happily switch to the new syntax now though :D. The boilerplate is annoying to write out for small models.

Functions can be defined so that they have access to previously defined variables (this requires finding the right block and using closures to define the function).

A+++++++++++++

#6

Wouldn’t we be able to apply some of these things to the existing language? If we can use the AST to find whether it’s transformed vs not, then couldn’t we simplify the blocks and just have functions, data, parameters, model, and generated quantities?

#7

My one big opinion is that I don’t like real(0,) replacing real<lower=0>. Although the former is less typing, it’s also less clear what it’s doing.

(I’m strongly in favour of more typing for clearer code because that’s why we have keyboard shortcuts)

Other than that, I liked the blocks and the programming style they force on people. But that could just as easily be me being used to the existing style. I will say that the blocks always madeStan code easier to read than BUGS code.

I also don’t understand what point 3 means.

I don’t see how the 10 line code achieves point 1 (at least any more than the standard syntax does). It that’s probably me not understanding what you mean by following a generative story.

#8

I like very much Bob’s proposal.

I was just about to write that (0, ) is a notation for open interval, but then it doesn’t explain int(0,1), but I could live with that.

Aki

#9

Wait, maybe I do understand.

Is it more generative because it defines the priors, then declares data, then defines the model?

#10

This should be

data vector[N] y ~ normal(alpha + beta * x_std, sigma);

right? Or am I not understanding something about the proposed syntax?

#11

Yeah, I think that’s the idea.

#12

Isn’t the story:

• set priors
• design experiment (ie choose likelihood and priors for likelihood
parameters)
-gather data

In which case the data declarations are in the wrong place (they should be
at the end)

#13

I feel the same way, both about the blocks making Stan code easier to read and also that this may just be me being used to it.

If I understand Bob correctly then the main thing about point 3 is that being able to declare parameters inside functions would be really convenient. In the example Bob gave

vector non_centered_normal(int N, real mu, real sigma) {
param vector[N] beta_std ~ normal(0, 1);
return mu + sigma * beta_std;
}

the parameter beta_std is declared inside the function and nowhere else. So the entire non-centered parameterization can be wrapped up into a single function. The way Stan is currently designed this isn’t possible.

#14

Hmm, I do see your point.

#15

That would be SUPER convenient!

#16

What does it mean to have a conditional parameter that’s only defined on some iterations? Having trouble picturing how HMC interacts with that…

I am definitely in favor of getting rid of the transformed X and generated quantities blocks, though perhaps we could do that without getting rid of all blocks?

I think I mostly like the move to add annotations about what kind of variable each one is individually. It might be nice to keep a block syntax for declaring many at once even long term (that probably looks similar to existing Stan blocks).

I don’t think I like the real(0, ) syntax much. Prefer the old <lower=0> personally as it’s more explicit and doesn’t look like a syntax error.

#17

@seantalts I don’t actually think there would be different parameter spaces
for different iterations in this proposal. The conditions for defining
parameters wouldn’t be allowed to depend on other parameters, so presumably
they wouldn’t change between iterations.

#18

The problem there is incompatiblity. My proposal is backward compatible for models. But it may certainly split people on how they write their models, as ~ and target += is already doing.

Yes, I don’t plan on changing anything about the model concept, inference, I/O etc. It’ll need a new generator and a lot of inference on the inside.

There are multiple issues here. Static typing just means the type doesn’t change. The blocks tell you what a variable is used for and give you a scope of where it’s defined. Now that I think about it, there’s also the issue of where validation would happen for something like transformed data or transformed parameters. That happens now before you get to the model block.

In that ways, it’s like PyMC3 and Edward.

I don’t see how to do that. How would we distinguish data from transformed data? And where do we put something that might wind up being a generated quantity or transformed parameter depending on whether it’s used in the log density or not?

As I responded on the blog post, I don’t treally like either approach. Maybe for the positive case we can just use positive_real (like positive_ordered).

When you generate data from a linear regression, you first decide on the number of predictors, then generate the coefficients and noise scale, then you generate the size of the data, then the data. (The predictors don’t fit in here neatly unless you give them a model.)

Yes, that’s why this is a really bad notation. I don’t like real<0, > either, and real[0, ] is also confusing with arrays. Grr.

Exactly (see above).

Right.

Exactly the point.

They’re in the wrong place in the current setup and the right place in my first example above. Each data declaration is pushed down as far as possible if you think about it in terms of a canonical form.

Jonah got this one:

I should’ve stressed this point more overall:

I don’t think I emphasized the above enough in the original post!

You must really hate modern C++, which still mostly looks like a syntax error to me. I think the consensus is in on this one—no (0, ) notation. Oh well, it’s awfully long as it is. Maybe positive_real will be enough to simplify most code.

Exactly.

#19

If we want to keep the explicitness but shorten it a bit without deprecating anything, then we could allow up and lo as valid abbreviations for upper and lower, for example:

real<lo=0,up=1>

#20

On backward compatibility, what about having the version of the Stan
language used specified at the top of the Stan program code? Specifying
the version could turn on a “strict mode” or enable some backward
incompatible features.