Prior initialization closer to declaration

seantalts · February 17, 2017, 10:17pm

We talked on Thursday about a potentially new way to add prior information inside the parameters block (to be closer to their declaration), and Bob brought up a restricted initialization syntax looking something like real x ~ normal(0, 1);. I think Andrew and Michael both didn’t like this but didn’t go much into why. Anyone know what the reasoning against that syntax is?

betanalpha · February 19, 2017, 1:22am

My argument is one of abstraction/encapsulation. Ultimately the Stan modeling language specifies a posterior density function with respect to some Lebesgue measure, pi(parameters | data). In order to specify such a density we need to define the parameter space, the data space and a point in that space (i.e. the values obtained in a measurement), and then the functional relationship between the two. The beautiful aspect of the language right now is that each of these specifications match up to programming blocks, (data and transformed data; parameters; transformed parameters and model). Embracing this perspective (define two inputs and the one output) has been very successful when teaching courses.

Specifying priors in the parameters block completely breaks this abstraction and makes it much less clear what the language is doing.

There are a slew of technical issues (often no canonical prior/likelihood decomposition, complexity of code that can be executed outside of model block, etc) but I find breaking the really nice abstraction enough of a reason to avoid anything like this.

Bob_Carpenter · February 20, 2017, 6:31am

I see the argument for both sides.

It’s nice to have uses of variables close to their declarations. It requires less vertical scanning in the code, as Andrew pointed out. This is a big deal in my opinion. But it also led me to ask why we wouldn’t put the modeled data sampling statements in the data blocks (other than that they depend on yet-to-be-declared parameters).

I don’t like multiple ways to do the exact same thing. In this case, the sampling statement’s just going to have the exact same effect as if it’d been in the model block.

I also don’t like the idea of just allowing arbitrary sampling statements in the parameters block. I was suggesting something like

real y ~ normal(0, 1);

which lets you write arbitrary independent priors (you can pack computations into a function if necessary) and also some non-independent ones because later parameters may depend on earlier ones.

What Andrew wrote down was more general:

real y;  y ~ normal(0, 1);

With this, there’s nothing to stop a user from writing their whole model in the parameters block.

So I think Michael’s argument is more that we want to keep the density defining statements (on the constrained scale) together in the model block. Part of the density is defined by the constraints on the parameter declarations.

This is related to the issue of whether to allow sampling statements in the transformed parameter block. My thinking is that it’d be a natural place to put a Jacobian. Michael’s reaction was the same as to this proposal, and we eventually decided not to allow it.

syclik · February 20, 2017, 2:13pm

I also see the argument for both sides. I’m on the side of not allowing it.

I completely agree. The language is flexible enough to do things in multiple ways, but we should reduce the complexity and decouple the purpose of code in different places. I really like that the language has purpose and scope and that we declare data, parameters, and define a log joint probability distribution function up to an additive constant. I like that these purposes line up with the different blocks in the language.

seantalts · February 20, 2017, 6:17pm

Do most users have the same measure-theoretic mental model when they’re using Stan? Is there potentially some place for a higher-level language or library on top of Stan that lets users think in terms of priors and a bayesian modeling process? Seems like maybe Andrew and Michael have different mental models they prefer and that each one could deserve its own mode of expression… In that case maybe Stan is the mathematically elegant, more general measure-theoretic thing and someone could build some kind of social-science-friendly PGM (or something) modeling library on top of it? I’m coming at this from a pretty naive outsider’s perspective so I might be totally off, sorry about that. My general thinking here is that modeling problems as layers of languages at different levels of abstraction is a pretty nice paradigm, something I inherited from Dave Ferrucci.

andrewgelman · February 21, 2017, 1:25am

Hi, Sean, I don’t think my mental model is different than Michael, it’s just a tradeoff. Allowing priors near the declarations can make models more compact and easier to read and encourage the use of priors (that’s why I like the idea), but it can also possibly confuse people (that’s what Michael doesn’t like). Nothing to do with mathematical elegance or measure theory or social science.
A

betanalpha · February 21, 2017, 1:36am

Right – no appeal to measure theory here, just the definitions of Bayesian inference! The fact that we’re defining a probability density over an implicit Lebesgue measure means that we’re hiding all kinds of detail from the user already for the sake of comprehension (if shallow comprehension).

In case it helps, I think there’s another way of posing my concerns. Often times probabilistic programming is defined as a programming language where random variables are first class objects. Now this definition is already a mess because random variable is a loaded term, but what the definition is trying to get at is that the type system is able to differentiate between parameters and not parameters (which we colloquially call data). In Stan we don’t actually do this with explicit types but rather the scoping of the parameters block, which I believe is the most intuitive and pedagogical approach. I think that moving sampling statements, or really any statements, into the parameters block is really a move towards defining parameter with a distinct type as other probabilistic programming languages do. I’m less opposed to making the jump completely (although I’d still prefer the status quo) but I think that making a partial jump just confuses things.

andrewgelman · February 21, 2017, 2:32am

Yes, hopefully we can help the discussion move forward with some examples.
A

andrewgelman · February 22, 2017, 8:01pm

Hi, I think I prefer this:

vector[4] beta; beta ~ normal(0,1);

rather than this:

vector[4] beta ~ normal(0,1);

My problem with the second formulation is that I might want something different; for example,

vector[4] beta; beta[1] ~ normal(0,10); beta[2:4] ~ normal(0,1);

I think this kind of thing would be very useful for experienced users such as myself and also for novices. For example, a novice might write:

vector[4] beta;

And then another novice could immediately see the problem and say, “Hey, where’s your prior on beta?”

Or, a novice would write:

real beta; beta ~ uniform(2,4);

And it would be super-clear that they’re using this uniform prior without constraining the parameter.

So, lots of advantages.

A

sakrejda · February 22, 2017, 11:31pm

I would prefer to keep the language as is because the advantages I’m seeing described could also be handled by a text editor plugin. For example an editor could highlight where your parameters appear in the model block and flag parameters that never appear in the model block. I know we don’t have a DAG so we can’t say much, but we can definitely deal with highlighting missing priors from the AST constructed by the stan compiler.

Additionally I think the current organization even if a little artificial makes models easier to read—I don’t have to try and figure out if the user has scattered their model over the parameters/transformed parameters/model block. I think we underestimate in general how hard it is for new users to grasp the model when it’s scattered like that.

Krzysztof

Bob_Carpenter · February 23, 2017, 6:59am

You can tell if a parameter is used somewhere in the model, but it’s harder to tell if it gets a prior. There are target += statements where it can show up in arbitrary function calls.

Bob_Carpenter · February 23, 2017, 7:23am

We’re definitely all on the same page as to how the programs get interpreted statistically and computationally.

This post from Michael and talking to Sean today reminded me of my perspective when first laying this out, which were largely derived from a combination of looking at BUGS and MacKay’s deifnition of HMC in his book, which is super clear in terms of what it needs to operate (all the algorithms in that book are very easy to read, which is one of the main reasons I like it so much).

From the HMC perspective, what you need is a log density function to define the potential energy.

BUGS gives you a language that can be translated to log densities. I just interpreted the sampling statements in BUGS as giving you log density increments. I hadn’t really thought at all beyond graphical models at that point.

What I never liked about BUGS models is that it was a guessing game to figure out what the types of variables were (oh, this one shows up on the left of a Poisson, so it must be an integer; oh, there’s an index, so it must be an array). I’ve always believed in strong static typing for languages. I thought we’d get a ton of pushback on this, but the main complaint is that there are three 1D array structures (real[], vector, and row_vector).

But I only wanted the user to have to think about shape. Variables should behave the same way everywhere, just be declared in different places. As Michael just pointed out, that’s very different in some of the other languages where the type of an object depends on its usage (data or parameter) and in BUGS where the language never specifies what the variable is going to be used for—that’s only figured out dynamically when the data is read in.

So what I needed to do was somehow define that log density function. Throughout MCMC, the data remains constant and the parameters vary. The parameters vary, but they’re fed in by the sampler (or optimizer) from the outside. The services the model needed to provide were those of an object storing data as member variables and providing a method to compute the log density.

From that object-oriented perspective (which is exactly how it’s coded in C++, by the way), the data block variable declarations give you the member variable declarations, which are the same as the arguments to the construct. The entire model class is immutable—once you construct it with data, nothing in it changes and it’s stateless.

The variable declarations in the parameters block give you the arguments to the log density function and the model block gives you the body of the log density function. The return value, the log density that gets accumulated, is implicit.

That’s all I really did in the first version and it just ran a multivariate normal with fixed covariance and mean. Then I did some transforms manually and had to learn about Jacobians. At that point, it became pretty clear how to do the transforms to the unconstrained scale to simplify our sampling and optimization algorithms.

Once we had data and parameters, it was easy to define derived quantities that depended only on data (transformed data) or on parameters (transformed parameters). My thinking about those came straight out of Andrew and Jennifer’s variable classification in their regression book. In fact, you can still see some of their terminology floating around the AST and grammars under the hood.

Generated quantities then seemed the natural way to do posterior predictive checks in a way that could be computed once per iteration rather than once per leapfrog step and with double types rather than autodiff types.

That’s really all there is. Everything else is just basic imperative programming language constructs, which sort of fell out when we realized we weren’t using the graphical structure anywhere. We can write whatever we want in the model block as long as it’s equal to the log posterior up to an additive constant that doesn’t depend on parameters. So you can unfold some conjugate priors and code up posteriors directly for more efficiency in some cases. That only occurred to me much much later. I started thinking it was just like BUGS, only it translated the sampling statements to log density increments rather than edges in a graphical model.

syclik · February 23, 2017, 1:32pm

+100

Mind posting that on a blog? It’s what I know, but can’t explain well and the way you wrote it was very clear.

andrewgelman · February 24, 2017, 1:25am

Hey, that’s a fine blog post!

Bob_Carpenter · February 24, 2017, 10:15pm

Scheduled for Tuesday at 3 PM.

Topic		Replies	Views
Generalizing the Stan language for Stan 3 Developers features	56	5114	September 1, 2017
Stan++/Stan3 Preliminary Design Developers	97	5739	June 12, 2018
Order of statments in model block Modeling	2	738	January 18, 2018
Newbie questions on how Stan weights the likelihood by the prior + implications for visualization Modeling brms	17	1021	June 23, 2023
'Pedantic Mode' is ready for your testing and feedback General language	29	2944	May 22, 2020

Prior initialization closer to declaration

Related topics