GPU Stan


I’m curious if there is a future in which Stan is able to compute its lp in a GPU accelerated fashion. GPUs are a no-go for smaller models, but for example, large-ish time series models there is enough parallelism, along at least one dimension, to benefit. For example, when fitting a very long nonlinear AR process, the AD lp gradient calculation would be dominated by the map & scan over the time dimension of the model, and this could be parallelized with a GPU. Is this something which could be accommodated within Stan, or am I better off attempting this by writing the code by hand?

(I’ve seen some discussions here on using OpenMP/MPI, but those APIs are significantly different from the GPU APIs.)



One of the big problems with GPU speedups for us is that we are almost certaintly going to need more than single precision. And it’s hard with our architecture to preload and reuse data. So we’re first looking at low data flow, high compute cycle operations like Cholesky decomposition (quadratic data, cubic time).

See also the branch with working code and corresponding pull request from Steve Bronder. The pull request isn’t ready to go, but we’re very very keen to start GPU-enabling some of our big matrix operations:

We’re attacking the parallelization of higher-level algorithms, like running multiple ODE solvers in parallel, using OpenMP for multi-threading and MPI for multi-process. Sebastian Weber has examples of this going that gives you nearly embarassingly parallel speedups. And these ODE problems are some of our harder problems.

Then, underneath the ODE solvers, we can start exploiting GPUs again, but these tend not be great GPU candidates as they involve repeated application of relatively small matrix operations.

The trick in all of this is getting the automatic differentiation synched up. And I think we can do this all asynchronously with our autodiff design and only have to block for required input values (in other words, very much the same kind of high-level design as TensorFlow uses as it’s the obvious way to cut a differentiable function apart [you see this going back decades in the autodiff literature]).


in other words, very much the same kind of high-level design as TensorFlow uses as it’s the obvious way to cut a differentiable function apart

what about STALINGRAD? Wouldn’t it require also an
optimization compiler of the graph with an “load balancer” to shift the calculation efficiently. Optionally to have
an integrated nested Laplace Approximation module and an neural network to learn about when to use INLA or
"regular" NUTS.
If blood temperature to max: This is no critics. It is maybe manageable by using already existing libraries and efficiently combine them.



I don’t know anything about the Pearlmutter and Siskind implementation.

If you do reverse mode autodiff, the condition is that all the parents of a node have to have passed gradients down before it can pass gradients down. This is solved in most system by keeping a stack of the expression graph which if you build it up in order is naturally topologically sorted. If you don’t do that, you can keep passing things down, but it becomes very inefficient as you have potentially exponentially paths to pursue in an unfolded graph. Like I said, there’s an enormous autodiff literature on this and sparsity detection, both of which are critical (speaking of INLA!).

The only motivation we’d have for using INLA in applied work would be if full Bayes via MCMC is too slow. Same motivation as for variational inference, expectation propagation, and any other appoximation. What we tend to do instead is work with people who claim only INLA can fit their models and then show them how to do it with Stan—that was the root motivation of the CAR case study Mitzi just did (; she started with Max Joseph’s CAR case study:, and a working INLA model that failed to run in BUGS/JAGS—I think she’s going to build out some of the comparisons with INLA in a later case study.