New array syntax might mean Stan language 3.0?

I’m just starting to review @rybern’s awesome work on this and agree with Dan that moving to only accept the new syntax is definitely a 3.0 change. I wonder if we can do a little work to avoid breaking backwards incompatibility- Ryan (or others) do you think there’s a way to make stanc3 parse both old and new formats seamlessly? Or perhaps we can expose this new syntax only if you do some kind of special signifier comment at the top of the file like ### using Stan 3.0 ###?

If not, I think there are a couple of other things we wanted to get into Stan language 3.0. I kept something of a partial list as they came up but I don’t think it’s complete and hope everyone can chime in:

  • Remove everything currently marked deprecated (get_lp, if_else, increment_log_prob, log suffixed functions, lp_… more?)
  • Unify shadowing rules w.r.t. nullary functions
  • Maybe: remove data keyword? Idea here was to have the compiler do this automatically, though I think maybe it’s not doing this across function boundaries yet (?)
  • Maybe: remove target increments (and access?) from user-defined functions (Sean: I forget why this was on my list, maybe to make compilation easier?)

Definitely seems very incomplete; these notes are like 2 years old, haha. Just wanted to raise the topic and give what I have and hear what other folks may have been saving up for Stan 3.0.

I’d also like to make sure when we do release 3.0 that it’s very easy to run a command to at least translate a file from Stan 2 to Stan 3. I think with our compiler framework that shouldn’t be too hard, though we may need to keep separate copies of the front-end around in the code base for that purpose…

Hey Sean, quick clarification -

The current PRs do support the current syntax and don’t depreciate anything. You just can’t currently use the old syntax with multiple declarations or the upcoming tuples, and the auto-formatter spits out the new syntax.

3 Likes

Wooops. Never mind then, let’s postpone this discussion! haha.

1 Like

Actually, I forgot, they do depreciate something - array becomes a reserved word. That’s a smaller issue but maybe still an issue.

2 Likes

Ah okay. Yeah I saw in the tests that a lot of files changed to have the new syntax but I should have checked with you.

I think array as a new keyword is still backwards incompatible, though, right? Like previously good Stan programs won’t work anymore. So I think this is still something we should talk about - can we find a way to make the parser accept programs with variables named array still?

Stanc3 already added new reserved words and it wasnt deemed major-version-bump-worthy then. Is array that common of a name?

I do have an idea but it’s really hairy.

The issue is that if the first word in a statement is array, the parser needs to know if it’s being used as a variable or a keyword, and the context needed to figure that out is too far in the future.

One trick would be to recognize array[ as the token rather than array, but this still collides with e.g. an assignment to an indexed variable called array. (Another small downside of tokenizing array[ is that you can never have array types with a space like array [1,2] int.)

The potential way to solve that collision would be to treat array[x,y,z...] as its own reduction, and then recognize it in indexed variable contexts as well as type contexts. I think the the parser could then distinguish the usages because in the type case you’d always expect a type next, and in the indexed case you’d never expect a type next.

It’s definitely a hack that would add complexity to the parser, and it would be tricky to write and it might not work. Buuut I could try it if reserving the keyword is an issue.

Edit: Another downside of doing this is syntax highlighters would get confused.

1 Like

I was under the impression those should have already been reserved words but just hadn’t been due to implementation details, essentially. Maybe that wasn’t right, which ones are you thinking of? I’m not sure how common array is, though I think I’ve probably used it before, haha.

If most folks think it’s okay to add this to the reserved list (presumably with a good error message that tells you specifically how to upgrade your program, ideally on which lines) then I’m happy to go with that. I just don’t want to generate a bunch of “The NEXT version of the Stan compiler won’t accept your program” support issues, haha. Thinking ahead about how to avoid that seems good, even if we don’t stick to the letter of the semantic versioning guide.

One note is that bumping to Stan 3 doesn’t need to be a major event, though it would be fun if we did make it one. The (only?) downside of adding tons of awesome features in a backwards compatible way is you don’t get to have a huge release day where you announce all of the great changes in your next major version, haha.

This sounds pretty hairy and probably not worth it… Maybe we can make the pretty-printer replace “array” variable names and print the command out for how to do that when we see array used that way? Then the upgrade is one command at least.

The issue there is that the pretty printer needs to parse the program first. Not sure how to get around that without including a whole second parser.

By the way, error messages for keyword identifiers are easy to change (by design), and the current error message for array is: Expected identifier, but found reserved keyword 'array'

Ones that come to mind are lower upper offset and multiplier. Though there is an issue now for removing them from the reserved keywords.

I think there were some other but not sure.

1 Like

I have an intuition that we may actually get a bunch of value out of keeping two parsers around in the future for situations like this, offset, multiplier, etc… Would that also have made your life easier adding the new array syntax and maybe with the issue to remove multiplier and offset as keywords? We could kind of freeze it at the version 2.0 before offset and multiplier and use it as part of a stanc3 upgrade subcommand? Curious what @Bob_Carpenter and others think about that.

I also want to tag some people more on the Stan user side of the spectrum to get thoughts about array being a new keyword - @avehtari @anon75146577 @andrewgelman. The gist is that you can no longer name functions or variables array, and old programs that used that name will give you an error message explaining how to fix it. Maybe we can even provide a command that will “upgrade” your Stan file, if that helps?

2 Likes

I’m confused as to what the proposal is. Currently though I do get confused about the difference between vectors and 1-d arrays, or matrices and 2-d arrays. I’m never sure whether it’s better to do vector[5] theta or real theta[5] (or maybe it goes the other way, I can never remember).

The proposal looks like the winner of this poll,

array[N] matrix[I, J] <variable_name>;

I think for me the worry is specifically that we would make any models that use “array” as a variable or function name now emit an error message. By strict semantic versioning standards, I think that counts as a backwards-incompatible feature (rather than a bug fix) so it would indicate we should bump to Stan language 3.0. Which I don’t think is a huge deal but if we do that it would be nice to do a couple or other minor breaking changes at the same time. @Bob_Carpenter do you have thoughts on this?

2 Likes

Absolutely. Anything that breaks backward compatibility should up the major version number on everything. There’s a lot of other deprecated junk we can clean out, too, as @seantalts also mentions.

Doing this will break all the case studies, books, etc. that use arrays, which means pretty much everything.

That would take most of the sting out of the move, at least for power users who use Stan directly (and the brms/rstanarm, etc. users wouldn’t notice). We’ll have to update all of our core doc, too. I can volunteer to do the latter.

If we introduced new R packages for version 3, it won’t break any existing code. I think CmdStan users will be more amenable to running command-line programs to modify their old programs. In fact, we could just build that into the parser so that we could add a flag to accept Stan 2 by translating first to Stan 3. Not ideal, but workable.

As @rok_cesnovar points out, we’ve been pretty slack on the reserved words. Every time we introduce a function we don’t allow variables to be named that any more.

The data keyword is really doing two things now, that need to be teased apart. One is that it’s not an autodiff variable (as in integrate_ode) and the other is that it is constant data (as in map_rect).

They have a side effect on lp__ (in addition to print or exception side effects). Also, I don’t think they’re used much.’

Always a pain with bottom-up parsing.

I like the idea of just adding a 2.0 to 3.0 converter and having that be an option when running Stan. But I may not be thinking this through clearly and am not at all wedded to the idea.

Why before offset and multiplier?

2 Likes

The driving organizational principle is that vector, row vector, and matrix types can be used for matrix arithmetic and linear algebra. The rest is just a matter of efficiency.

There’s really no reason to ever use real theta[5]. It’s just there because it’s all easier to document and build and it’s more natural for some people. With the way it is, we have five basic data types, int, real, vector, row_vector, matrix plus arrays of any type.

We could theoretically eliminate the following type without loss of expressive power:

real a[M];  // use vector[M] or row_vector[M]

and also these:

real a[M, N];    // use matrix[M, N] a;
vector[N] a[M];  // use matrix[M, N] a;
row_vector[M] a[N];  // use matrix[M, N] a;

It’d just be harder to doc and users would have to remember what’s involved.

We could also eliminate vector and row vector types and just treat them as edge-case matrices.

vector[M] a;  //  use matrix[M, 1] a;
row_vector[N] a;  // use matrix[1, N] a;

Then all of our matrix operations return matrices. For example, a row vector times a column vector would be a 1 x 1 matrix rather than a scalar. And we’d have to use [1, 1] to pull out its value. But it would reduce type overhead.

Given that the stanc3 compiler can be in javascript form, we can easily make a Stan2 to Stan3 page. Maybe that is simpler for most users than a command-line thing? I dont know. For example this demo page - https://rok-cesnovar.github.io/stanc3js-demo/index.html I use to check if a user-reported model compiles took me literally 5 minutes to make (a bit more on sloppy design stuff which I am terrible at).

If we can add comments to the AST https://github.com/stan-dev/stanc3/pull/603 this can also use the canonicalizer and auto-format (the current auto formatter strips the model of comments) we can easily have a stan2-to-stan3 transform page.

2 Likes

One problem that arises for me in practice is when I can write a model just with vectors and matrices when I have continuous data, but then with discrete data I need to break out the arrays. Here’s an example from a recent model I was fitting:


data {
int N;
int K;
int y[N,K];
}
parameters {
real mu;
vector[N] a;
vector[K] b;
vector<lower=1>[K] omega;
real<lower=0> sigma_a;
real<lower=0> sigma_b;
}
model {
real E_y;
real var_y;
real E_y_mat[N,K];
real phi_mat[N,K];
for (n in 1:N){
for (k in 1:K){
E_y = exp(mu + sigma_a * a[n] + sigma_b * b[k]);
var_y = E_y * omega[k];
E_y_mat[n,k] = E_y;
phi_mat[n,k] = square(E_y) / (var_y - E_y);
}
}
to_array_1d(y) ~ neg_binomial_2(to_array_1d(E_y_mat), to_array_1d(phi_mat));
a ~ normal(0, 1);
b ~ normal(0, 1);
sigma_a ~ exponential(1);
sigma_b ~ exponential(1);
omega ~ exponential(1);
}

The data are encoded as y, a matrix of integers. I wanted to use the negative binomial model and so I used to_array_1d to let me do the computation. But then I needed to make the relevant inputs, E_y_mat and phi_mat, into arrays, not matrices. OK, I could’ve made them matrices and used to_vector, but that creates another problem: to_array_1d does columns first but to_vector does rows first (or maybe it’s the other way around). I would’ve needed to throw in transposes to get it to work. That seemed more confusing so I stuck with arrays throughout.

This happens to me fairly often, that I go back and forth between matrix and array representations because neither does exactly what I want. Matrices and vectors have the convenient math, but they don’t always play well with integer arrays of data.

4 Likes

I think that’s an argument for vectorizing 2D containers more than anything else.

Writing readable code is as hard as writing readable math papers and for the same reason—neither can sacrifice precision for clarity. And like writing papers, it takes practice and feedback, ideally in a mentoring situation with concrete examples people carea bout. In that vein, here’s how I’d code that model.

data {
int<lower = 0> N;
int<lower = 0> K;
int<lower = 0> y[N, K];
}
parameters {
  real mu;
  vector<multiplier = sigma_a>[N] a;
  vector<multipler = sigma_b>[K] b;
  vector<lower=1>[K] omega;
  real<lower = 0> sigma_a;
  real<lower = 0> sigma_b;
}
model {
  for (k in 1:K) {
    vector[N] lambda = exp(mu + b[k] + a);
    vector[N] phi = lambda / omega[k];
    y[ , k] ~ neg_binomial_2(lamba, phi);
  }
  a ~ normal(0, sigma_a);
  b ~ normal(0, sigma_b);
  sigma_a ~ exponential(1);
  sigma_b ~ exponential(1);
  omega ~ exponential(1);
}

This

  • adds constraints on data for error checking and doc,
  • adds multipliers to declarations of a and b to allow natural expression of non-centered parameterization,*,
  • renames exp_y to lambda because I find the former notation confusing when the value exp_y takes on in a Stan program is only an expectation of y for the draw at hand, not in general,
  • reorganizes terms in lambda so that the two scalars mu + b[k] are added first before being added to the vector a (addition is left associative),
  • replaced lower bound of 1 on omega with a lower bound of 0 and then removed the subtraction of 1 (should give same result given scale invariance of the exponential prior),
  • replaced resulting square(lambda) ./ (lambda * omega[k]) with lambda / omega[k], which is way more efficient, and
  • rather than building up a matrix as a local variable with lots of fiddly indexing only to tear it down, it vectorizes over N and calls neg_binomial_2 a total of K times. There’s only a tiny loss in efficiency (which should be more than made up by not building and tearing down matrices), because there’s no shared computation in neg_binomial_2 due to all three arguments being vectors.

The algebra for phi going from as @andrewgelman wrote it to the version I have is

phi = square(lambda) ./ (var_y - lambda)
    = square(lambda) ./ (lambda * omega[k] - lambda)
    = square(lambda) ./ (lambda * (omega[k] - 1))
    = lambda / (omega[k] - 1)

But then I replaced the lower bound of 1 on omega with a lower bound of 0 and dropped the - 1.

I haven’t actually tried it to see if it works.


* If you really do want to do this by hand, it should look like this:
parameters {
  vector[N] a_std;
  vector[K] b_std;

transformed parameters {
  vector[N] a = a_std * sigma_a;
  vector[K] b = b_std * sigma_b;

model {
  a_std ~ std_normal();
  b_std ~ std_normal();

That way, everything in the rest of the code is much easier to read, you get the naturally scaled versions of a and b in the output, and you get vectorized multiplication by sigma_a and sigma_b plus re-use of the values without multiplying again.

I don’t like the use of just a for what’s really the standard variate that gets multiplied to produce a. I generally like short names for variables, but here I make an exception and append _std to make it clear it’s the standard variate version.

1 Like

Bob:

Thanks! That was super useful. In addition, Ben Bales suggested using neg_binomial_2_log for computational stability, and that helped too. It’s possible that overflows were causing some of the poor convergence, I guess because if the target function is too far from its expectation, overflows can screw up the gradients so that Nuts can’t figure out how to come home.

Below is the model I fit. I had to reorder the parameters block because sigma_a and sigma_b need to be defined before they’re used as scales. I kept the E_y notation because I found it easier to follow. Also I kept omega as it was, because it has an interpretation as overdispersion, a parameter that’s always at least 1. Parameterizing the neg binomial in terms of expectation and overdispersion is the most natural way for me to do it. Indeed, now I’m not quite sure why we don’t do it that way in rstanarm.

data {
  int<lower = 0> N;
  int<lower = 0> K;
  int<lower = 0> y[N, K];
}
parameters {
  real mu;
  real<lower = 0> sigma_a;
  real<lower = 0> sigma_b;
  vector<multiplier = sigma_a>[N] a;
  vector<multiplier = sigma_b>[K] b;
  vector<lower=1>[K] omega;
}
model {
  for (k in 1:K) {
    vector[N] log_E_y = mu + b[k] + a;
    vector[N] phi = exp(log_E_y) / (omega[k] - 1);
    y[ , k] ~ neg_binomial_2_log(log_E_y, phi);
  }
  a ~ normal(0, sigma_a);
  b ~ normal(0, sigma_b);
  sigma_a ~ exponential(1);
  sigma_b ~ exponential(1);
  omega ~ exponential(1);
}

Every time I code one of these models, flame out, and get rescued by the Stan team, I learn so much! Thanks again.

1 Like