I’m resorting this in terms of order of importance (to me, of course).
Therein lies the main tension we’re wrestling with, in my opinion.
It’s powerful, but its power comes from imposing limits. And when you write bigger programs, you run up against those limits.
Computer programs wind up being made up out of gazillions of little blocks. What we’re doing now is like forcing all the I/O to be fed through a single big main()
function and restricting a no I/O policy on the subunits.
I’m thinking about this all very operationally. There’s a data context (mapping of variables to values [sorry, I’m pre-repeating myself as you’ll see later]) and a parameter context (ditto), and the data declarations read from the data context and the parameter declarations read from the parameter context. That’s how it’s implemented under the hood with readers in the constructor for data and in the log density function for the parameters.
That’s why I’m particularly concerned about separating the declarations of various data variables and parameter variables so that it’s not so clear any more what the function is. The way things are now, the signature of the data → parameters → log density function is laid out very clearly (modulo the implicit unconstraining transform and log Jacobian).
But I don’t see how that follows.
I’ve been looking at other probabilistic programming languages recently and I find they present the languages operationally, as if pyro.sample
or pyro.observe
are simple Python functions. But then they perform variational inference, which clearly isn’t doing anything like naive sampling. You can find this thinking encapsulated in a section title in the Pyro tutorials, “Inference in Pyro: From Stochastic Functions to Marginal Distributions.”
I have the same objections you (@betanalpha) do to this confusion. In Stan, we just say the program’s directly calculating a log density up to a constant. Looked at that way, it looks less like a probabilistic programming language and more like a function. And as I keep emphasizing, that’s still how I think of a Stan program—as a function from data to parameters to a log density.
The original motivation for using ~
came from BUGS, which used it to declare a graphical model. I’m not sure where the original motivation came from in stats, where people use it to lay out components of generative models. I borrowed it from BUGS with a different interpretation as an increment to the log density.
For the new compound declare-distribute syntax, the motivation for me comes not from BUGS, but from the compound declare-define statements in languages like C++. It’s just a convenient way of combining the declaration and use of a variable. Philosophically, I’m not trying to distinguish between
real x;
...
x ~ normal(5, 2);
and
real x ~ normal(5, 2);
other than that in the latter I don’t have to scan so far to find the use of the variable.
That’s what the spec will lay out. How it consistently computes a log density function.
Closures behave something like this. They grab their current context and produce a function. They’re essentially mappings from contexts (variable to value maps) to arguments to values.
In logical languages, you usually think about expressions or statements or other syntactic units having free variables, then things like quantifiers like for-all and there-exists bind them off. That intermediate structure with free variables can be thought of denotationally as a function from contexts (mappings from variables to values) into values. That’s essentially what a closure does, too.
I’m not sure what compositionality you’re talking about here. For the Stan 3 design, I think the issue is that the compositionality is one of variable contexts and target value, not of the joint density in any statistical sense. In order to compose program fragments, those fragments must be interpreted as relations between the contexts before and after the statement executes. This can all be compositional with declaring new variables—it’s the usual thing in programming languages to do that. Stan’s unusual in blocking off all the declarations at the top—that’s just because I was lazy in writing the parser, not out of any philosophical commitment to declarations-first.