We (Matthijs Vákár, Maria Gorinova, Mitzi Morris, Sean Talts, and Bob Carpenter) had a series of meetings last week about new directions for the Stan language based on Maria’s thesis on SlicStan (also a StanCon 2018 and POPL probabilistic programming languages workshop presentation). As a refresher, what’s nice about Maria’s approach in SlicStan is that it allows users to write down generative models in the order in which the variables arise and then is able to infer the block structure based only on which variables are declared to be data. SlicStan also allowed parameters to be introduced in the bodies of functions, which enables things like non-centered reparameterizations to be carried out in a function rather than cut-and-pasted across multiple blocks.
There were several drawbacks to the original SlicStan, including that functions had to be unrolled statically and there were no conditionals or while loops.
There was also no naming convention for parameters introduced inside of functions—they awkwardly had features of both global and local scope.
We met intensively to try to figure out how to do some of this stuff more cleanly. I talked about using something like macros with naming, which the programming language specialists shot down as ridiculously amateurish (they were more polite than that).
Matthijs was also struggling with the fact that these submodels or functions should provide both namespace-like and scope-like properties. Sean suggested pulling out this notion of submodel that can introduce parameters and separating it from ordinary functions.
Matthijs suggested an object-oriented design. Mitzi and I respectfully noted that most of our users don’t know objects from functions and would be very confused at manipulating constructors explicitly.
Throughout, Matthijs was stressing compositionality, which he roughly described as saying that the semantics of a program should be function, and that the bodies of submodels should behave just like models themselves. Basically, we can take the interpretation of a Stan program (or statement) S, written [[S]] to be a function on environments (mappings from variables to values) such that denotations compose for sequencing,
[[S_1; S_2]] = [[S_2]] \circ [[S_1]]
(Sorry, no proper denotation symbol built into LaTeX.)
I wanted arguments to what was essentially going to be a constructor, but I didn’t want any methods. Then it dawned on all of us seemingly at once we could just move the pieces around on the board (literally the board, but the moving was done with chalk and eraser).
To satisfy all this, we came up with a separate submodel design that has aspects of object-oriented design that should be relatively transparent and natural to users. I ran it by Andrew and Aki and they didn’t seem to object. (Nor did they whoop for joy, as I was somehow expecting. Ironically, really hard design problems that eventually get solved wind up looking too easy or even obvious in retrospect. Anyway, the users will thank us because the artifacts will all fit together smoothly.)
Example model
Now onto examples. I’m going to use a linear regression where the coefficients from a hierarchical model (it’s non-standard, but an easy example to illustrate what this is all about).
\mu^{\beta} \sim \mathsf{Normal}(0, 5)
\sigma^{\beta} \sim \mathsf{Normal}(0, 5)
\beta \sim \mathsf{Normal}(\mu^{\beta}, \sigma^{\beta})
\sigma^y \sim \mathsf{Normal}(0, 10)
y \sim \mathsf{Normal}(x \cdot \beta, \sigma^y)
Blocked, unfolded
The first version will explicitly declare variables to be either parameters or data or generated quantities.
param real mu_beta ~ normal(0, 5);
param real<lower = 0> sigma_beta ~ normal(0, 5);
data int<lower = 0> K;
param vector[K] beta_z ~ normal(0, 1);
vector beta = mu_beta + sigma_beta * beta_z;
data int<lower = 0> N;
data vector[N, K] x;
param real<lower = 0> sigma_y ~ normal(0, 10);
data vector[N] y ~ normal(x * beta, sigma_y);
data int<lower = 0> N_tilde;
data matrix[N_tilde, K] x_tilde;
gq vector[N_new] y_tilde = normal_rng(x_tilde * beta, sigma_y);
Unblocked, unfolded
The beauty of SlicStan is that it can infer the block level for variables by only marking the data variables (it could’ve alternatively marked the parameter variables). Variables that are assigned must be either transformed data or transformed parameters or generated quantities.
We’re going to go one step further and follow the grand BUGS tradition of letting the data tell us which variables are data. This isn’t necessary, but it will open up the possibility of using the same Stan program to run a model forward (from parameters to data) for simulation or in reverse (from data to parameters) for inference. We just erase the optional block qualifiers.
real mu_beta ~ normal(0, 5);
real<lower = 0> sigma_beta ~ normal(0, 5);
int<lower = 0> K;
vector[K] beta_z ~ normal(0, 1);
vector beta = mu_beta + sigma_beta * beta_z;
int<lower = 0> N;
vector[N, K] x;
real<lower = 0> sigma_y ~ normal(0, 10);
vector[N] y ~ normal(x * beta, sigma_y);
int<lower = 0> N_tilde;
matrix[N_tilde, K] x_tilde;
generated quantity vector[N_new] y_tilde = normal_rng(x_tilde * beta, sigma_y);
Unblocked, folded with submodels
Submodels will be a structure to introduce more model. They may have declared arguments that are passed into them (rather than declared locally). For example, the hierarchical normal prior with non-centered parameterization will look like this:
model hier_normal_ncp() {
real mu ~ normal(0, 5);
real<lower = 0> sigma ~ normal(0, 5);
int<lower = 0> K;
vector[K] beta_z ~ normal(0, 1);
vector beta = mu + sigma * beta_z;
}
The rest of the code involves a function for regression, which takes the regression coefficients as an argument (and uses their size to determine the size of the data matrix x
):
model linear_regression(vector beta) {
int<lower = 0> N;
vector[N, size(beta)] x;
real<lower = 0> sigma ~ normal(0, 10);
vector[N] y = normal(x * beta, sigma);
}
Prediction takes two arguments for the coefficients and the error scale. Both are needed to make calibrated predictions.
model lr_predict(vector beta, real sigma) {
int<lower = 0> N;
vector[N, sizeof(beta)] x;
vector[N] y = normal_rng(x * beta, sigma);
}
Then they’re just instantiated to provide a model.
model coeffs = hier_normal_ncp();
model lr = linear_regression(coeffs.beta);
model hat = lr_predict(coeffs.beta, lr.sigma);
Notice how the structure defined for the submodels (like coeffs
) allows meaningful naming in regression (coeffs.beta
). We’ll go with standard object-oriented syntax. What’s really happening here is that each model is an object and the .
is just a member variable accessor. All the variables declared in the submodel structure are member variables, just like all the variables in a top-level Stan program.
We’ve achieved the compositionality that Matthijs was after with an implicit form of his OO design.
To do: inferring GQs
One thing we didn’t figure out is how to infer that a distribution can be implemented with an RNG in the generated quantities block. This is going to be an obstacle to reversing models that involve discrete data because we can’t generate them in Stan because we don’t have discrete parameters. But if we could recognize they could be generated with RNGs as in the generated quantities, we’d be home free.
What we do know is that the parameters \alpha can be generated as generated quantities if the posterior factors as
p(\alpha, \theta \mid y) = p(\theta \mid y) \cdot p(\alpha \mid \theta).
But we’re not sure how to do that given the Turing completeness of Stan’s imperative language for statements.
Limited graphical sublanguage
For graphical models, we can infer exactly this kind of conditional independence. So it provides even more motivation for rethinking graphical model inference.
Architecture
We’re definitely going to build a proper intermediate AST this time. My trying to roll together a C++ well-typed AST (with variant types) was me being more theoretical than practical. Yes, it works, but it’s not easy to work with. Instead, Sean’s convinced us to go to a more s-expression like intermediate representation (think Lisp with the function name being the root node in a tree).
The sky’s the limit while we’re working on paper. Those whacky programming language mavens want to work in F# or OCaml. I have to say that Maria’s thesis code looks awfully clean (it’s in F# and there’s a Lex/Yacc variant that makes the grammars look like pure BNF). We’ve found Spirit Qi to be rather rough going in its demands on C++ and on programmers. But not working in C++ is going to present an obstacle to releasing in R or Python.
We also spent some time talking about redesigning the model concept for better I/O. But that’s another post.