Modularity for making Stan more Pythonic and Rthonic

Dear Stan collaborators with programming language interests:

I had an experience in class the other day which gave me some insight into the whole issue of Stan being more “Pythonic” or “Rthonic.”

I’d always thought of “Pythonic”/“Rthonic” as just being a stylistic thing, with some users wanting a certain look and feel with their code. But now I realize it’s more than that.

What happened was that I had a homework problem where I gave them a simple statistical model and asked them to compute the posterior mode and curvature, and then construct and draw from the relevant normal approximation. When we were discussing the solutions, one of the students said that it was really easy: he just wrote the function in Python, then used Pytorch to compute gradients, then used some gradient-based optimizer.

Of course you can also use Stan to write an objective function, compute gradients, and do gradient-based optimization! But this wasn’t obvious to students, because Stan appears to be a bundled package.

So it struck me that what would really make Stan more Pythonic or Rthonic is modularity, for a Stan call to be less bundled.

In particular, what if we think like this:

  • A Stan program is instructions to compute a target and its gradient. Input data and parameters, output target and gradient vector.

  • Stan also comes with an optimizer: input data, initial values for parameters, and tuning parameters for the optimizer, and a number of chains, and it outputs a solution for each chain and some diagnostics.

  • Also NUTS etc.

I have a feeling that at this point, Bob will (correctly) reply that this is what Stan already is, so what’s my problem? And to this I’d reply that, somehow, this is not what Stan looks like or how we use it.

Here’s an example of how it could be useful to set things up:

I’m working in R or Python and I write a Stan program. I do this because it’s convenient, typically because the target function I’m working with has some statistical interpretation and Stan’s modeling language is a good way to express what I want to write. (This will be even more so in the future with ragged arrays and other features that are in the pipeline.)

Then I do something like foo <- stan_compile(“foo.stan”), or the equivalent in Python.

I can now play around, for example:
a <- list(…)
b <- list(…)
foo(data=a, parameters=b) returns the target function and its gradient (and also does whatever foo.stan might do in its run, such as print statements).

Or:

hessian(foo, data=a, parameters=b) returns the second derivative matrix (which can be computed in Stan or else in R using numerical differentiation)

Or:
stan_optimize(foo, data=a, inits=…, tuning_parameters=…, num_chains=4)

Or:
stan_sample(foo, data=a, inits=…, tuning_parameters=…, num_chains=4)

Or:
summary(stan_sample(…)) would return basic inferences and convergence diagnostics etc.

etc.

All these would work in Python too, or for that matter from the command line.

The point is that, once we’ve done foo <- stan_compile(“foo.stan”), or the Python equivalent, “foo()” just looks like an R function (or they Python equivalent).

I’m wondering whether it would make sense to do CmdStanR, CmdStanPy, etc., that way?

I’m sure I have some details wrong on this, but I think the basic idea here is super-important. I expect this is what Mitzi, Sean, Bob, etc., have been thinking about already, so it’s probably just a communication issue.

But I think it’s a very important communication issue so I’d like to figure this out. I think it will also be important for our workflow book, and also for people trying to see how Stan fits in the ecosystem with Pytorch and various autodiff libraries.

2 Likes

What I mean when I say RStan is going to be more Pythonic is that there are going to be fewer stand-alone functions and more class methods. I think the ideas you are talking about in this post are mostly separate from that distinction. But it is the difference between the Pythonic

foo.optimize(...)

and the traditional R way

optimize(foo, ...)

The code you wrote is not Pythonic in this sense.

I was talking with Alan Blackwell about programming languages and he is working on alternative syntaxes that support different styles of learning–he is focused on teaching programming. We could do the same but with Stan having a parser that supports pythonic syntax, R style syntax and so on.

For example I write my R to be as ‘Stanic’ as I can, so I use ‘=’ for assignment and end my lines with ‘;’ . It keeps my syntax errors down.

We could do the reverse and make Stan more Ronic by using ‘<-’, dropping ‘;’ and have inexplicably overloaded function signatures (joking)… but I imagine that Stan could look a lot more like R, particularly on the blockless design. It could just be a option on the compiler. Now we just need to find a PhD student to do it.

…and we could have translations from Ronic Stan to Pythonic Stan and regular Stan.

Breck

Ben:

Interesting. As a non-Python user I didn’t know about that. It does seem like the syntax of CmdStanR and CmdStanPy (or however we do this) will have to differ so as to respect the different syntaxes of R and Python.

As you say, my points above are orthogonal to this, but I think they’re important in their own way. It seems that Stan is viewed as a Bayesian inference engine with a modeling language attached to it (fair enough, as this is kind of how we originally thought of it, indeed we thought of the modeling language as a shortcut for just writing the target function directly in C++), but that to better communicate to users – especially users who are coming from the computational or machine learning direction rather than the statistics direction – we should flip it around, and present the Stan language as an efficient way to write encapsulated code with gradient calculated using autodiff, and then there are various functions available to access and use this function that calculates target and autodiff.

This new formulation also will fit in better with algorithms such as EP and GMO that use the target and gradient in various ways, and also it should work well with our plan to develop and debug such new algorithms in R or Python rather than directly in Stan.

1 Like

Yeah but only a little bit. In python methods are accessed via . and in R via $ (for the reference class system we plan to use). In python the user will use object.foo() and in R it will be object$foo(). So the symbol is different but the idea is the same and it feels the same to the user.

Indeed. Certainly for that audience I think it would be good to have some introductory materials that present Stan from this perspective.

We need not just the examples but also the functions!

I guess one step would be to construct an example using constructed R functions just to show how it could work?

I think rstan can do this now, it’s just not very user friendly and somewhat limited until we have some “Stan3” stuff. Did you mean wrap up what rstan can do into another function that looks like what we plan to do in the future?

Yes, exactly. We could sit togehter and write this as a mini case study

Yeah. I don’t know if the names for the gradient methods we need have been decided on yet (there may be a wiki somewhere in the wiki labyrinth), but if not we can come up with reasonable placeholders. Also, the necessary methods won’t be available via CmdStanR or CmdStanPy (they only call CmdStan, which doesn’t give access to what we need), but it will be doable via rstan and more conveniently in version 3.

Maybe first try could be that you sketch a case study and then we look at it together? I guess any example could work. Maybe a logistic regression with one predictor (e.g., bioassay example from chapter 3 of BDA) could be a good start, in part because the Stan code is clean so this is an implicit advertisement for the modeling language? (We could write 2 versions of the Stan model, one using binomial_logit and one computing the target function directly.)

I’m happy to give it a try, but can you say a bit more about what you’re looking for? Is it basically just a demonstration of how to write the Stan program, run it, and access the target and gradient?

How about this:

  1. The model (in this case, simple logistic regression with flat priors on the two coefs) in Stan.

  2. Compile the Stan model to create foo(), an R function that computes the target and gradient.

  3. Demonstrate the use of foo() and hessian(foo) or foo$hessian to compute target, gradient, and hessian at will. Thus demonstrating that foo acts like any other R function.

  4. stan_optimize(foo) which yields optimum and hessian and draws from the Laplace approx

  5. stan_sample(foo) which yields NUTS draws

  6. All the above steps in Python.

Then maybe we need a slightly more complicated example that includes parameters with constraints. But the above would be a start. The point would be to show how Stan works in this R or Python environment

Ok so the general idea still makes sense to me but I think you should bring this up at the Stan meeting because some of these points, at least to me, seem to actually conflict with the proposed objects/methods design in the roadmap and I think it would be easier to make sure everyone is on the same page in person/video than a long email/discourse thread. It might just be a misunderstanding that we can sort out quickly.

OK, will do. I think in any case it will make sense for us to do the case study because only then will it be clear to all (including ourselves) what we’re talking about!

This is good. This would be also a very good demonstration for how to test a new algorithm implemented in R or Python. After someone has been able to implement stan_mynewalgorithm(foo), we can use bayesbench package to run that algorithm for a bunch of test models.

For a Python-like syntax for writing Stan programs, there’s already IBM’s YAPS. Isn’t that more or less what you are asking for?

There is YAPS, but tbh I don’t personally like the syntax. Stan lang is more clear than writing stan as python.

@jonah beat me to the punch :-)

There are at least two factors here.

  1. We don’t make a big deal out of all this functionality in Stan, so I don’t think a lot of people know it’s there. But as Andrew suspected, I’m going to say all of this is already relatively straightforward in Stan. What Andrew is calling stan_sample and stan_optimize exist as functions sampling and optimizing in RStan. I believe summary is called monitor. But I don’t think we expose the Hessian calculations anywhere. That’s going to change as soon as we finish testing higher-order autodiff (this is a high priority for me over the next year). Or we could expose RStan’s finite differencing method now.

  2. Even if we cleaned up the APIs and created a case study and good doc, people will still gravitate to tools with which they’re familiar. I believe Stan may still behind BUGS+JAGS+OpenBUGS+NIMBLE in users, for example, because there’s so many people who learned BUGS and it was perceived to be good enough for their applications. Using the Stan langauge and interface is high overhead which doesn’t usually pay off until you have to build complicated models.

I think this workflow can be made easier with not much effort. Let’s look at specifics.

I’ve long argued for the following API in both R and Python (switch . to $ for R and figure out how to assign to a list in that last example):

joint_lpdf <- stan_model('foo.stan')
posterior_lpdf <- joint.condition(data = ...);
lpd <- posterior.log_density(params = ...);
(lpd, grad_lpd) <- posterior.grad_log_density(params = ...)

If you wanted to turn the posterior lpdf into a proper function, you can use a one-line closure:

posterior_lpdf2 <- function(theta) posterior.log_density(params = theta)

My biggest complaint about the RStan and PyStann APIs is that they work through these monolithic Stan fit objects (that’s also in the roadmap going forward, so at most I’m lobbying for new functions here). In particular, what I’m calling posterior_lpdf has to be a stan-fit object created by running 0 iterations. I think the posterior_lpdf could be much more lightweight than a stanfit object (in R). For example, none of the sampler parameters or chains are relevant as it’s run for only 0 iterations in 1 chain

There’s a conceptual obstacle right up front in that Stan programs don’t need to define joint densities—they only need to define a log density proportional to the posterior. For example, the user’s guide chapter on efficiency suggests expanding conjugate priors to just define the posterior directly. This is accomplished by transforming this program defining the joint density log p(theta, y | a, b)

data {
  int<lower = 0> a;
  int<lower = 0> b;
  int<lower = 0> N;
  int<lower = 0, upper = 1> y[N];
}
parameters {
  real<lower = 0, upper = 1> theta;
}
model {
  theta ~ beta(a, b);
  y ~ bernoulli(theta); 
}

into a Stan program defining the posterior log density log p(theta | y, a, b). We just unroll the conjugate prior and code the posterior directly:

data {
  int<lower = 0> a;
  int<lower = 0> b;
  int<lower = 0> N;
  int<lower = 0, upper = 1> y[N];
}
parameters {
  real<lower = 0, upper = 1> theta;
}
model {
  theta ~ beta(sum(y) + a, N - sum(y) + b);
}

If this was it, then theta could be defined as a generated quantity—but it might get used elsewhere and need to be a parameter.

I absolutely

I think the issue isn’t so much whether Stan uses <- or = for assignment, but whether it’s embedded. PyMC3 is a Python API, not its own language. I also think it could get very confusing if we have two ways of writing Stan programs. We still allow <-, but it’s deprecated because we didn’t want two ways to do the same thing.

Andrew says

As Jonah says,

@seantalts proposed a more heavyweight integrated server architecture in the roadmap, which is also the approach that @ariddell and @ahartikainen are taking on PyStan3 (I don’t think the roadmap specified a language for the server as it’s more defined by its endpoints than its internals and I don’t know how the proposed server relates to the existing PyStan3 server).

When I wrote the first version, I thought of it the other way around—as a modeling language which would make it easy to experiment with and deploy inference engines. We were evaluating HMC and eventually NUTS, but we also built Metropolis and ensemble samplers (both of which never found use cases where they were better than HMC for Stan models). We’ve also included penalized maximum likelihood. We actually don’t have a way through R or Python to fit posterior modes of the Bayesian model, because we drop the Jacobians for optimization (@avehtari has been lobbying to include a version that keeps the Jacobians for optimization). Many users ignore the Bayesian aspects of Stan; for example, Prophet users tend to fit penalized maximum likelihoods rather than using approximate Bayesian inference because it’s much more efficient.

I don’t think we can build max marginal methods the way @andrewgelman wants without having a way to fix some of the parameters as data. That will require some more fiddling than we have now on the C++ side or will require writing multiple Stan programs. I’m not sure what EP is going to require, but I believe it’s a way to do data subsetting. I don’t think either algorithm is general enough we can just throw them at arbitrary Stan models (max marginal methods need to know what’s getting marginalized and EP needs to know how to slice the data).

+1 to that (of course, I’m biased).

1 Like

Wait, can you put these complaints intothe roadmap thread? You once asked a question there about whether fit objects were mutable and the reply was that they weren’t, but I didn’t know you had some further distaste for them. Would love to hear more about your views.