Generalizing the Stan language for Stan 3

If that could be done to allow for partial evaluation then we’d be able to drop the transformed data block entirely and at least get that simplification.

Is there an equivalent backpropagation that could be used to check if variables influence the target or not? Then we could avoid computing hanging branches of the expression graph which would allow for some clever dynamic model specifications (i.e. specify various sequences but then if/then using data variables to determine which sequences modifies the target, and then evaluate only the relevant sequences. I imagine that it’s still best to keep generated quantities on its own.

Yes. Mitzi’s just finishing up refactoring the underlying type system to open the way for new types beyond int, real, vector, row_vector, and matrix. Tuples will be first

Tangential, but this is exciting

The beauty of the blocks is that it facilitates breaking those bad habits and reteaching the correct understanding of the probabilistic system.

That’s true.

This is very true.

Yeah that’s mostly not true.

1 Like

Just because I’ve my opinions on some of this stuff has evolved now that I have a better idea why some of the proposed changes were proposed (ie now that I know more stuff, old me was wrong. Specifically on point 2) here’s where I stand (roughly in order).

(Also because reading back I sound a lot more negative towards this proposal than I actually am. Opinions on internet, eh)

  • I think whatever mechanism is needed to automate things like non-centering would be a massive benefit. Literally anything that I don’t like about the changes I will like just to have that ability in the language. I understand Mike’s objection to doing this, but I don’t agree with it. It breaks an abstraction by changing the parameter space, but in breaking it, we make life much easier for users and make model code much more readable. You’ve got to make the purity vs usability tradeoff somewhere.
  • I’ve changed my mind (ie I was wrong before) on declarations like real<lower=0> std ~ normal(0,1);. I don’t think my previous objection was based in reality, but on my weird interpretation of things (and particularly mapping coding style into model building, which probably isn’t the way other people think). People will not treat priors the way I like regardless of how this is in the language, so I don’t see any reason not to make this convenient for people.
  • I like the idea of relaxing the structure somewhat. I personally think the proposal goes too far and it will be very hard to read Stan programs to extract the models (no harder than BUGS programs, but also no easier).
  • I think allowing sampling declarations in the parameter block would go far enough towards the concept of “things should be defined close to where they’re declared” to be useful. In my mind, there’s a legibility cost to moving these things closer together and I like this tradeoff, but each to their own.
  • What do people think of merging (data & transformed data blocks) and (parameters & transformed parameters) blocks and allowing blocks in any order? Again it puts some definition near declaration, but it keeps some of the classic Stan structure.

I’m really looking forward to the 2.0 of this proposal. I don’t think anyone will be completely happy with whatever happens (because no one is completely happy with how things stand). But I think there’s a lot of good stuff in 1.0 (unsurpringly because @Bob_Carpenter ain’t no slouch)!

2 Likes

I mean the difference between the untyped and simply typed lambda calculus. In the former, you can apply a function to itself, whereas in the latter you can’t.

BUGS enforces this. They also prohibit variable reassignment. Every variable gets assigned to exactly once. I think this might be something that would be better handled with a warning in some kind of nanny (lint) mode.

I was the one who designed it that way for exactly that reason in the first place. I’m not sure what pedagogy argument we’re talking about. Teaching programming is hard, especially with a quirky language like Stan.

The problem with this from a practical standpoint:

input double x;
input double y;
double z = x * x + y * y;
output double sqrt(z);

isn’t that it doesn’t clearly delineate a signature, it’s that it’s longer than the natural code:

double hypot(double x, double y) {
  return sqrt(x * x + y * y);
}

But I think it’s clear what it’s doing.

It used to be the case that people would try to declare their local variables at the top of a block and always return from the last statement in a function definition. But C and other languages are perfectly happy letting you mix definitions and declarations in the bodies of functions.

double hypot(double x, double y) {
  double x_sq = x * x;
  double y_sq = y * y;
  return sqrt(x * x + y * y);
}

As I said, I originally designed the blocks the way you describe exactly because I think of the data declarations as arguments to the constructor and the parameter declarations as arguments to the log density function. I’m all for making that clearer.

I think the transformed data, transformed parameters, and generated quantities blocks are causing a lot of confusion among users. Maybe we can solve it with education.

The main problem (and it is a problem) that Andrew brought up is that variable use can be a long way from its declaration (this can also be a problem with C++ function arguments).

It’s actually pretty easy. The provenance of each variable is tracked. So it’d be easy to figure out what is transformed data vs. transformed parameter. Generated quantities are a different story, but we can also see which things aren’t used in the model.

Yes. It’s all possible with relatively simple graph algorithms on the AST.

Agreed. Michael’s the outlier on this one. The question is really just how “clean” you want this functional abstraction to be. Remember, C++ lets you do this:

double foo(double x = 3, int y = 10) {
  ...

The problem with merging is that we need a way to specify among the data variables which is input and which is computed; same for parameters. Barring that, we’ve talked about removing transformed parameters and just putting them in the model directly.

Multiple order is hard with scope. As is, things have to be defined before they’re used. This wasn’t true with BUGS because it was just defining the graphical model.

I really was just throwing this out to see what ideas people liked or didn’t like. I wasn’t expecting such a strong reaction all around. Thanks for all the feedback.

And don’t worry, any changes to the languages are only going to happen with a lot of discussion and warning.

My own reactions are pretty much in line with everyone else’s, which is why they’re so conflicted. I do not want to go back to something unstructured like BUGS, yet I see our users getting deeply confused by the current block structure. I also find the current structure overly verbose, but then I don’t like the terser one as much (perhaps because I’m so used to our current programs).

My main problem in reading BUGS models wasn’t the lack of structured blocks, it was that I couldn’t tell the types of the variables and I couldn’t tell what was data and what was a parameter (I know the latter is determined dynamically after seeing the data).

1 Like

But don’t you have to do this anyway to implement the proposal? Or is this in some way harder?

The blocks are named, so couldn’t you reorder them internally (I’m thinking of things like putting the functions block at the end sometimes, not crazy shuffling the blocks!)

No, the data variables are declared as such. With Stan now, the problem is how we make the break between variable declarations and statements. Now, the first statement that is not a declaration is the end and no declarations are allowed after that.

Maybe we could just take variables that are never defined to be the data that needs to be read in? It all seems a little loose to me, but I haven’t thought about it much.

We could do that as a pre-processing stage, but I don’t see the point. As is, the functions need to come first so that they’re defined when they’re used.

A lot of this is just artifact left over from having a one-pass parser (C/C++ are terrible in terms of ordering mattering—I got so spoiled by Java).

What I do want to do is allow more general function definitions using closures. They’d implicitly have access to all variables in scope when they were defined. (Static lexical scoping in CS terms.)

I might be in the minority, but of the things that I’ve really come to love about the more abstract mathematical notation for a function, f : X -> Y is that it immediately defines the signature first and foremost. I really, really like the blocks reinforce this function-signature-first approach.

Additionally, it is my opinion that any benefit from “use close to where you declare” is not worth the potential confusion of more new users falsely mistaking real x ~ normal(0, 1); as a declare-and-assign. It’s almost baiting new users to make that mistake.

In any case, I don’t think there will be much debate about removing transformed data and transformed parameters if we can do enough static analysis in the parser!

1 Like

This is an important thing to ponder.

When I started using Stan I assumed this was the case. If we’re fixing things we don’t like about the language, I think we should look into whether the parser can throw an error (or a warning) if ~ is used twice for one variable. Is this one of those things that would be impossible to implement?

We’ll always have target+= for the cases where you need to do delicate things, but I don’t really understand the logic for taking a symbol away from its mathematical context. (Incidentally, the syntax is “baiting new users to make that mistake”, not anything around where it’s used)

2 Likes

I completely agree this is the major concern. I have a hard time judging what can and can’t be corrected by teaching. I, of course, am not going to get confused that the compound sampling statement is a declare and define.

On the other hand, I really thought you’d like that you can put the modeled data after the parameters—seems more generative to me, but then again, I can just see right through these abstractions to what’s really going on.

Category-theoretic Stan would be such good publication bait. Of course, then even I wouldn’t really understand it.

1 Like

We’ve talked about this and other lint-based approaches before. There’s always a trade-off in warnings and now, I’m not sure we’re on the right side of them because we have too many false positives.

The challenge is that the problem is undecidable due to the halting problem. We can’t even evaluate statically whether a statement will be reached based on the input data, so we can’t tell if it’ll be reached twice. It’s the exact same problem to decide if a variable has zero priors, exactly one prior, or multiple priors.

For example, with loop bounds,

for (n in 1:f(x))
  y[n] ~ ...
y[N] ~ ...

we need to know if f(x) is the size of y less one, so that each variable gets exactly one distribution.

if (g(x))
  y ~ ...

Same thing, only now we need to know if g(x) evaluates to true.

So the trick’s going to be finding a condition under which we can throw a warning. Clearly if the same variable shows up twice not in a loop or condition, with no indexes, we could mark it.

Now at run time we could start keeping track, but that’d induce some performance overhead.

I think worst case / most simply: because our code has no side-effects, we could definitely do a single evaluation of the model block and see how many times a variable gets ~'d to before we start sampling.

That’s a way to go, but I don’t think it’s going to be simple, though it may still be the simplest thing to do.

It’s still going to be a heuristic, because there’s no guarantee the same log density statements get evaluated each iteration.

And we still need to do things like fish variables out of expressions, where we can’t really tell if they’re on the left or right of the ~. For instance, I can write

y ~ normal(mu, 1);

or equivalently

target += -0.5 * (y - mu)^2;

With the target increment form, we can’t tell if y or mu is on the LHS, as mu - y gives equivalent results.

What’s the relevance of not having side effects? Technically, all the math lib stuff has side effects on the global autodiff stack, but it only persists through a single log density evaluation.

1 Like

I’ve read some literature that’s reluctant to execute code as part of analysis because it could contain side-effects. But with our code we are free from that class of worry, though as you mention the autodiff facilities wouldn’t work with partial or out-of-order evaluation (but that seems fine for this and some other analyses). You’re right though that it will always be a heuristic in a Turing-complete language.

At some point I’d love to just dive in and start looking at the AST for these expressions. Is there a human readable intermediate representation I could print out?

Another option would be to do what many languages do for “immutable variables” and treat ~ as something akin to const in C++ – something you must assign to at declaration time. Obviously this isn’t backwards compatible, though.

1 Like

Creating such a thing might be a good start. It’s fairly well organized in the lang/ast directory.

If you look at the expression graph, then you can change evaluation order if there aren’t dependencies—it just has to respect the topological ordering of the expression graph. There’s a huge literature on static analysis for sparse Jacobians and Hessians.

2 Likes