Generalizing the Stan language for Stan 3

Maintaining older versions is a lot of overhead, so we’ve never wanted to take that on. I’m also worried about forking the community on versions.

So far, the only backward breaking change since 1.0 has been to no longer allow direct access to lp__ as a variable.

I disagree with you about the canonical form: from a generative perspective, the complete model exists without data, which isn’t consistent with the proposal. To be consistent with this, you’d need to have something more like

param real alpha ~ normal(0, 10);
param real beta ~ normal(0, 10);
param real(0, ) sigma_sq;
real(0, ) sigma = sqrt(sigma_sq); // transformed param
real(0, ) tau = inv(sigma_sq); // generated quantity
int(0, ) N;
vector[N] x;
vector[N] x_std = (x - mean(x)) / sd(x);
int(0,1)[N] y ~ normal(alpha + beta * x_std, sigma);
data N, x, y; // Declare which parameters are data.

Now I am not advocating this (I think it’s a bad idea). But the current proposal puts a giant barrier between prior and likelihood modelling, which I don’t think is very good practice. (Your prior isn’t your generative model. Your full model is generative).

Can you expand on the advantage of ordering Stan programs in a more generative way. It is separate from points 2 and 3 (as far as I can tell) and seems more a philosophical thing than a “fixing a hole” thing.

Maybe I should’ve said “a canonical form”—they’re not unique.

Not sure what that means, but it’s a good semantic question. Is the number of data points collected N part of the model or not? That quantity’s not modeled. For Stan (or Edward), the program puts in a placeholder for data that gets set at runtime. For PyMC3, the data’s baked right into the program. For BUGS, everything only gets resolved at runtime.

I didn’t understand your objection here.

This was largely motivated by

  1. users getting confused about what goes where in blocks, so I thought we could figure it out automatically for them.

  2. Andrew wanting to be able to write compound declare/sampling statements for priors, along the lines of

    parameters {
      real alpha ~ normal(0, 1);
      ...
    

I’m not wedded to this, by the way, lest anyone get the wrong impression I want to start building this tomorrow!

The way I’m thinking about it is that a model declaration is generative if it can generate any data that it isn’t given.

Because Stan doesn’t do explicit discrete parameters (yet :p), N will always have to be given as data. And for this type of model, x is usually data. But if y wasn’t declared as data I would want a generative model to generate y. (This would be unspeakably convenient for CGR stuff)

In this sense, it doesn’t really matter where x and N are defined as data. I’d say that putting them at the top is essentially putting them as “preconditions”: you must have these two things to run the model.

The data declaration for y could go in that block, but if the model is generative it’s not a precondition (the model will run with or without y as data). But then this seems too complicated.

The objection (if it is that) is that separating prior and likelihood is moving against the current thread of our work, where we’re starting to get a bit strident about the idea that these should not be separated. To quote the title of the paper that will hopefully be finished very soon * The prior can generally only be understood in the context of the likelihood*.

This makes partial sense to me. I like the data, parameters, model blocks. I can live with the generated quantities block, but I definitely think that it should be possible to do transformations within the data and parameters blocks. I have no experience with the functions block, but the idea of being able to declare “odd” parameters (like for non-centering) in there seems like a good thing.

I don’t like this. It reeks of the “power user” thing of wanting to type less / build models faster, rather than make models clearer.

Even when it’s annoying, the current Stan language does mandate good programming practices, whereas I can see this being a mess.

I haven’t seen any news on functional programming… is that planned as well?

I would like to Curry my functions, pass functions around and wrap them into closures. I know this is asking for a lot, but if this is a thread for wishes, then those are mine.

3 Likes

Could you clarify why this is a problem? I thought that with lexical scoping having these single statements wouldn’t be a big deal (it’s not like it’s doing voodoo to pull variables from surprising places).

It places concision above clarity. (Same comment I had on real(0,) vs real<lower=0>) It also makes a program harder to read (as there are a number of places where modelling components can be added) and hence harder to debug.

NB: This is a stylistic preference. It doesn’t mean that I’m right. Or to put it differently, I’m not going to die on this hill if you don’t agree with me.

I can sort of see that, but with lexical scoping it still isolates the calculation to a specific chunk of code with well-defined inputs/outputs and calculations. I was trying to elicit more detail because this doesn’t seem like a big deal (nobody needs to die on any hills).

I quite like the idea that the parameters declared in the parameters block are things you put priors on, while the ones declared implicitly (say in the non-centering example above) are parameters you don’t interact with in the model block.

As I said above, I also have a fairly strong “ideological” preference for not separating the likelihood and prior statements because your priors will often need to be updated as your likelihood changes (eg half Cauchy on standard deviation is fine with Gaussian data, but much less fine with binary data), while I think that this model promotes the idea that your priors should be fixed.

3 Likes

So does “conditional parameters” mean that the parameters exist or don’t exist depending on the data present when the program starts running?

Yeah I think so. Currently you can fudge it using zero-length arrays with something like this

data{
  ...
  int<lower=0,upper=1> has_alpha;
}
parameters {
  ...
  real alpha[has_alpha];
}

I can definitely see why this is appealing to Andrew (and others?), but I’m in agreement with @anon75146577 here.

That’s what JAGS and BUGS do. They just define the model as a directed acyclic graph and you determine what’s data at runtime.

Right—just need them to be defined before they’re used.

I think this is also what Michael was objecting to when Andrew wanted to write

parameters {
  real mu ~ normal(0, 5);
  ...

The only separation required is that things are defined before they are used. In that sense, the priors in some sense really do come before the sampling distribution, because the parameters defined by the prior are used as arguments to the sampling distribution. Of course, Stan as it is now, just collapses all this into a single log density definition with no ordering requirements because the parameter values come in from the algorithms.

I don’t know what this would mean. The transformed data and transformed parameters blocks are where you transform data and parameters. Did you have an example in mind?

That’s not how I view it at all. It’s generally good practice in programming to declare variables near where they are used. This is trying to solve the problem of users forgetting to include priors and needing to continually go back-and-forth between the parameters block and model block to line things up.

That was my goal all along, for instance with typing. But it falls down in not allowing compound declare/distribute functions to keep declarations near usage.

Yes. Mitzi’s just finishing up refactoring the underlying type system to open the way for new types beyond int, real, vector, row_vector, and matrix. Tuples will be first, but I’d also like to add simple functional types (like typed lambda calculus rather than the untyped form you see in Lisp or the loopy typing you see in something like ML).

I think that’s a legitimate concern for real(0, ). But you also have to keep in mind that adding verbosity can reduce clarity because it makes the code hard to scan. This is why I’m always harping on about formatting.

I didn’t intend to promote that idea! What is emerging is that we’re trying to deal with a conflict between:

  • wanting to keep prior and likelihood target log density manipulations together

  • wanting to keep variable declarations close to their usage

Exactly as Jonah says.

I repeat that I like a lot Bob’s proposal

and add comment specifically related to this. Currently declaration
real<lower=0> sigma;
and prior
sigma ~ normal(0,1);
can be far from each other, which has caused confusion.

It would be easier to see that
real<lower=0> sigma ~ normal(0,1);
has a half-normal prior (alternatively we could require writing normal(0,1)[0,]

1 Like

Conveniently most of my concerns about the proposal mirror what @anon75146577 has already said. In particular, defining the data and parameters in blocks before the model is perfectly generative. Dare I say it’s more in line with measure theory as you’re defining your spaces and then the probabilistic object that lives on those spaces (the joint target represented with a density).

But one thing keeps coming up in arguments for the proposal that bothers me:

real mu ~ normal(0, 5);

is not an assignment if mu is a random variable.

First and foremost, such an assignment need not be complete because later on we could write

mu ~ normal(0, 5);

again. Unlike for deterministic variables this does not simply reassign a new value to mu but rather modifies its distribution, so arguing that having an inline declare and assign is easier to parse is a bit naive because it does not preclude the full definition from appearing arbitrarily far away in the program. One could argue for changing how ~ works to not allow double assignments, but I’m not sure that that is even technically feasible and it would break backwards compatibility.

But perhaps more importantly, regardless of the details we never assign anything to a random variable. The Stan program never evaluates the distributions and assigns explicit values to those variables (algorithms acting on the Stan program are really responsible for instantiating the random variables with appropriately distributed values). All we are doing is defining the target distribution via a density, which acts on the sample space but is not defining a point in the sample space. Remember that random variables are functions, so inline assignments are more akin to lambdas then assignments. And given the possibility of calling ~ multiple times, they’d be akin to some kind of open lambda scope that can continuously be transformed.

Anyways, I think the far more appropriate abstraction is that the data and parameters blocks define the signature for the target density function pi(parameters | data), and then the rest of the blocks are simply defining that function. I think that this is the only way that we can robustly teach what we’re doing in Stan without having too many users inevitably trying to reason about the program as if it were deterministic, and unfortunately it fundamentally requires the blocks as they are currently configured.

Still just shorthand for target +=.

But there’s no way to write target += inline that I can think of.

Not so much easier to parse, but easier to keep track of. You want to use variables near where they’re declared. That could be

real alpha;
target += alpha;

if you don’t like the sampling statements. Declare-and-sample or Declare-and-increment-log-density-target are just one means of getting the use closer to the declaration.

This is the hard part for users to understand. The data comes from the outside at one step. Then the parameters come from the outside from the algorithm itself, not from within the Stan program.

That was exactly my original motivation in setting things up that way. But it’s not particularly apparent to the users from the block structure.

The abstraction in terms of what the program compiles down to remains exactly this—data declarations are arguments to constructor, parameter declarations are arguments to log density.

What makes it “loopy” and what are you thinking of using (assuming you’d model it after some existing type system)?

Also just to be clear: +1 most of Bob’s proposal, around optional blocks with param and data annotations that allow definitions with priors/assignments together at last and automatically deciding what is a “transformed data” etc.

I also like @betanalpha’s suggestion that we change ~ to only allow a variable to be “assigned” to in this way once - I don’t like the idea of “some kind of open lambda scope that can be continuously transformed.” I realize this is the way it is now and that there is an interesting use case for doing something like that, but I think the way ~ is typically used in math and the way people typically think about it would preclude that use case, right? And people could always just use the alternate form that explicitly adds to target if desired, which seems much more clear in its activity. [Usual disclaimer - I’m new, probably wrong, etc]

I actually hadn’t thought of it this way yet. Does that impact your pedagogy argument? :P It is a nice way to think about it. If this is our main model, maybe we should actually make it look like defining a function signature (perhaps complete with param and data annotations on function arguments)?

Because it’s probably not a good idea! ;-)

I understand the appeal, but notice how that doesn’t apply to function declarations in any language (that I know of at least) aside from lambdas. It’s always signature first and then definition.

If I flipped this around and argued that we should rewrite C++ so that function definitions looked like

input double x;
input double y;
double z = x * x + y * y;
output double sqrt(z);

would everyone still feel the same way?

Yes, and we are continuously working to find better pedagogical presentations for this. But the problem is not that the current syntax is an obstruction, in my opinion; the problem is that users come in with poor teaching and bad intuition. The beauty of the blocks is that it facilitates breaking those bad habits and reteaching the correct understanding of the probabilistic system.

Yes, but not because blocks are bad but rather than users are coming in with false understandings about what a Stan program should be doing. I am afraid that the redesign will just facilitate users continuing on with those false understandings.

Sure, but I’m not so much interested in the internals so much as how the language facilitates good Bayesian modeling. There’s always going to be an overhead there because good Bayesian modeling is hard and made even harder by decades of poor pedagogy, so users will always complain. Until they see the light. :-D

Another thing to keep in mind – the only reason we have a transformed data block is to avoid recomputing data-only calculations every time we evaluate the log target density. It wouldn’t be easy, if even possible, to statically evaluate a program in the proposal to determine which calculations are data only, evaluate them ahead of time, and then return the modified expression graph.

Transformed parameters is less of an issue, but eliminating it would still requiring being able to identify every intermediate parameter and allow the services to filter only those intermediates desired by the user, or having some kind of annotating for saving intermediates.

I wasn’t advocating that! The problem is that there are classes of distributions (Markov random fields, determinental point processes, etc) that can’t be specified generatively and hence require multiple ~ calls, or multiple target +=.

You clearly weren’t paying attention in the last workshop. ;-)

Seriously, though, we can always improve our pedagogy and we are consistently doing so! This could be another opportunity.

I would be happy to consider ways of making the functional nature more clear, although writing an actual signature would be a pain given how many parameters and data that we typically have. In some sense the blocks are a better way of defining a function signature when the domains and images and high-dimensional.

I think it’d be like the first dataflow analysis task in the textbook - a conservative version could be a system that just “taints” potential transformed data stuff if its assignment is ever impacted by a parameter or by something already tainted by a parameter.