Stan on the GPU

As to your big question, the title of my post might give it away:

http://andrewgelman.com/2017/03/15/ensemble-methods-doomed-fail-high-dimensions/

This “100-fold speedup” thing is slippery. 100-fold speedups are achievable already using @wds15’s OpenMP parallelization of multiple ODE solver calls. That’s just parallelizing the likelihood in a Bayesian model.

If we need a lot of posterior draws (either because we want very high precision or we have very slow mixing), then after adaptation, we can perform those in an embarassingly parallel fashion.

There are some adaptation steps that I think we can parallelize.

Then if you look at something like Riemannian HMC, there are Hessian computations which are embarassingly parallel up to the number of parameters. These dominate RHMC’s compute time. And RHMC mixes amazingly well for hard problems, so this has some chance of making that tractable.

All the data (in the statistical sense) can sit on the GPU. But many of the operands are parameters, which vary draw to draw during MCMC.

We’re not 100% sure how much precision we need for accurate Hamiltonian simulations. We do know that it will vary by model. This could be explored without GPUs and I think it’d make a great project for someone. Matt Hoffman did a little bit in a post either here or on our old mailing list.