Stan on GPU: looking for model+dataset examples for empirical evaluation of speedups


As some of you might already know, there is an ongoing effort/project to parallelize parts Stan on the GPU for faster inference (stanmathcl).

We are currently in the process of

  • empirically evaluating speedups we can achieve with what has already been parallelized (Cholesky, multiplication…) and
  • trying to identify other common computational bottlenecks, so we can prioritize what to parallelize next.

So, we turn to the Stan community for Bayesian modelling scenarios that are often encountered in statistical practice and where you just wish inference could be a few times faster.

All suggestions and examples are welcome, except those where the main issue is very poor mixing (nothing we can do about that with this research) and where the dataset is to big to fit in GPU memory (not focusing on these kind of problems right now). And we’d love it if you can provide Stan model code and a dataset.


I would like to suggest to start here: Small-area spatiotemporal analysis of pedestrian and bicyclist injuries in New York City. @mitzimorris has set up all the “machinery” for a subset of the whole dataset here: Spatial Models in Stan: Intrinsic Auto-Regressive Models for Areal Data . You have to look at the “Bigger data: from 56 counties in Scotland to 709 census tracts in Brooklyn” section.
I am looking forward to good news!


I have a scenario of hierarchical modeling that usually takes a few days of runtime. I would be happy to provide a testing dataset if this is a scenario that could be helpful.


You might want to specify what’s “too big to fit into GPU”


Most GPUs have 8-12GB of DRAM. So the problems can be large, but not massive


Maybe too involved to get into, but here’s a data generation and fitting example from a recent paper I worked on discussing interventions in dynamic systems with a psychological focus… parallelizing over subjects in the Kalman filter would be a big improvement I guess.

####Individual differences in treatment effects in dynamic system

nlatent=2 #number of latent processes
nmanifest=1 #number of manifest variables
tpoints=30 #number of measurement occasions
ntdpred=1 #number of time dependent predictors
nsubjects=30 #number of subjects
TDPREDMEANS[floor(tpoints/2)]=1 #time dependent predictor (intervention) after 50% of observations

  n.latent=nlatent, n.manifest=nmanifest, n.TDpred=ntdpred,
  LAMBDA=matrix(c(1, 0), nrow=nmanifest, ncol=nlatent),
  DRIFT=matrix(c(-.2, 0, 1, -.2), nrow=nlatent, ncol=nlatent),
  DIFFUSION=matrix(c(.1, 0, 0, 0.00001), nrow=nlatent, ncol=nlatent),
  MANIFESTVAR=matrix(c(.1), nrow=nmanifest, ncol=nmanifest),
  TDPREDEFFECT=matrix(c(0, .2), nrow=nlatent, ncol=ntdpred),
  CINT=matrix(c(0, 0), nrow=nlatent, ncol=1),
  MANIFESTMEANS=matrix(c(0), nrow=nmanifest, ncol=1))

dat=aaply(1:nsubjects, 1, function(x){ #generate data w random parameter in DRIFT
  tempm$DRIFT[2, 2] = -exp(rnorm(1, -2, .5) + stdage * .5)
  cbind(ctGenerate(ctmodelobj=tempm, n.subjects=1, burnin=50, logdtsd=.3), stdage)

#convert to long format used by Bayesian ctsem 
datlong=ctWideToLong(datawide = dat, Tpoints = tpoints, n.manifest = nmanifest,
  n.TDpred = ntdpred,n.TIpred = 1,manifestNames = c("Y1"),
  TDpredNames = c("TD1"), TIpredNames=c("stdage"))
datlong=ctDeintervalise(datlong) #convert intervals to abs time

#create and fit model
fitm=ctModel(Tpoints=tpoints, type="stanct", n.latent=nlatent, n.manifest=nmanifest, 
  n.TDpred=ntdpred, n.TIpred=1, TIpredNames = "stdage", 
  LAMBDA=matrix(c(1, 0), nrow=nmanifest, ncol=nlatent),
  DRIFT=matrix(c("drift11", 0, 1, "drift22"), nrow=nlatent, ncol=nlatent),
  DIFFUSION=matrix(c("diffusion11", 0, 0, 0.001), nrow=nlatent, ncol=nlatent),
  MANIFESTVAR=matrix(c("merror11"), nrow=nmanifest, ncol=nmanifest),
  MANIFESTMEANS=matrix(c("mmeans_Y1"), nrow=nmanifest, ncol=nmanifest),
  TDPREDEFFECT=matrix(c(0, "tdpredeffect21"), nrow=nlatent, ncol=ntdpred))

#only the persistence and strength of the predictor effect varies across individuals
fitm$pars$indvarying[-c(8,18)] = FALSE 
#and thus standardised age can only affect those parameters
fitm$pars$stdage_effect[-c(8,18)] = FALSE 

stanmodel=ctStanFit(datlong, fitm, fit=FALSE)
stanfit = stan(model_code = stanmodel$stanmodeltext, data = stanmodel$data, iter=50, chains=3, cores=3)

ctsemfit = stanmodel
ctsemfit$stanfit = stanfit
class(ctsemfit) = 'ctStanFit'

ctKalman(ctsemfit, subjects = 1:3, timestep=.01, plot=TRUE,
  kalmanvec='etasmooth', errorvec='etasmoothcov', legendcontrol = list(x = "topleft"))


Stan uses standard cholesky decomposition. Eigen also provides a pivoted version which is more stable for ill-conditioned matrices. Maybe another comparison we can do is to check GPU version vs eigen pivoted version.


I previously did this here. Though the dropbox file no longer exists I pasted the results into the conversation

EDIT: Also this was for the old implementation using ViennaCL, not our current implementation


Great. Missed earlier conversation.


A simple GP like that shown in section 18.3 in the manual but with a large value for N. Maybe show the runtime for a few different values for N?


A hierarchical model like the one shown in section 9.13 of the manual but with a large value for K and a large value for J.


I have a complex economic model of auto insurance consumer choices (i.e. which plans people choose over time) and their accident/loss distributions (estimated jointly). I can’t share the real data since it is very much proprietary but can share a simulated DGP + model code. Currently the model takes several days for as little as 20,000 observations (ideally I’d be able to do at least 100k), and I suspect there is a lot of room for computational improvement. The model is for an academic publication and my co-author and I would be very interested in collaborating on speed improvements.


I should say that part of what makes the model slow is that we have to compute the “menu of options” (i.e. what consumers would expect to pay) for every possible insurance plan, which boils down to lots of matrix multiplication and simple (but numerous) integration - i.e. things that could benefit greatly from parallelization.


Have you considered MPI?

If you split things into independent computations then MPI is great. Of course, you can also combine the techniques… which has never been done in stan and would be interesting to see.


Currently we only have the Cholesky Decomposition on the GPU, but multiplication is probably one of the next places we will work on.

I agree with @wds15, if you can split the problem up it sounds like you could get a quicker win with MPI.


You can use MPI with Stan? (I didn’t know what that was and needed to Google it so forgive me if the question is silly)


MPI is coming, so the question is not silly at all. See here:

However, the first implementation can speed up things dramatically, but it is not truly user friendly since you have to cram your data into the format we expect it for easy distribution via MPI.


Thanks for all the comments and suggestions!

@fabio : We wanted to add a spatial covariance model and I was already looking at that case study. It is close to the top of our list and we will look at it very soon!

@mike-lawrence: We are already looking at a GP and we’ll add that hierarchical model to our list. @Gang if you could provide another hierarchical model and data, that would be great (we prefer real-world problems over just generating toy datasets).

@Charles_Driver : Yes, the example is a bit involved. Not at the top of our priorities (also looks like something that would benefit from MPI), but will definitely look into it at some point.

@yizhang @Stevo15025 : Stan doesn’t use pivoted version of Cholesky, so it wasn’t a priority to implement it on the GPU. However, as part of our work we plan on adding GPU-parallelized versions of some less standard methods, including (most) matrix decompositions that are in eigen but not used by Stan.

@shoshievass : Data, even simulated, and a model would be appreciated. We might also use it at some point, as @wds15 mentioned, to look at combining MPI & GPU. @Stevo15025, I think matrix multiplication is already in stanmathcl as are a few other minor things, but @rok_cesnovar would know precisely what.


Awesome! What sort of timeline are you guys thinking of? I can probably share a version of the model + simulated data reasonably soon if you guys or @wds15 would be willing to see if MPI and/or GPU-parallelization could help + how 0-:)


Really exciting - thank you for the response!