Scalability of bayesian glmm



I am currently learning stan from a book called “Bayesian Models for Astrophysical Data” and I am totally hooked. However when I ran stan on a simple linear mixed model

y_{i,j} = \beta_0 + \mathbf{x}_{i,j}^T\beta + \alpha_i + \epsilon_{i,j}

with 4500 observations and a total of 3 fixed effect and 20 random effect coefficients (+ 2 coefficients from a prior on each group of coefficients) it took quite some time (~4.7mins). I used the following mcmc config

fit = pystan.stan(model_code=stan_code, data=model_data, iter=10000, chains=3, thin=10, warmup=6000, n_jobs=3)

In one of the applications where I want to use the model I have 1.1m observations and probably at least 10k parameters (because there are 2 hierarchies with 3000 and 1000 categories each). Is this doable at all in stan?



I think it’s feasible but it might require a lot of patience (or additional computational resources). :-)

There are actually a number of things you can do to optimise the performance of your current set-up. You probably don’t need to thin the chains and you can try with iter = 1000, warmup = 600. The number of iterations you need will depend on how precise you want the estimate of your parameters (or quantities of interest to be). The precision again depends on N_eff.

There are efficient and inefficient implementations of linear mixed models. The Stan manual gives a lot of possible improvements. If you are comfortable with R, you can also try the Rstanarm or brms packages which have pretty efficient implementations.

Sometimes, you can simplify linear mixed models by summarising your data per group in your lowest level in the hierarchy. I think there are some sections in the manual on sufficient statistics.

(Weakly) informative priors can also improve the speed (and efficiency in terms of N_eff).


You have to be careful with parameterizations of these models to make them fit. There’s a lot of discussion in our manual—don’t know about in that book.

You also probably don’t need 1000 draws in 4 chains for your inferential purposes—usually these models converge and mix quickly enough you can get away with fewer warmup and sampling iterations.

As @stijn points out, there are very careful implementations that are part of rstanarm for this that condition the predictor matrices and use non-centered parameterizations for hierarchical components.


thanks Bob. So I am using rstanarm now for a linear mixed model (>700k observations, 4 fixed, 4 random effects). When I run it with the default rstan sampling parameters (i.e. chains = 4, iter = 2000, warmup = floor(iter/2)) on 3 cores, it tells me

Gradient evaluation took 0.651 seconds
1000 transitions using 10 leapfrog steps per transition would take 6510 seconds.
Adjust your expectations accordingly!

You suggested to reduce the number of draws or number of chains. Is there any heuristic/rule of thumb or is it so hardware/dataset/model specific that I just have to experiment until I get something which is acceptable?

Edit 1: When it tells me for 1000 transitions it takes x seconds, and I specified iter=500 in the stan_lmer rstanarm function, then it should take x/2 seconds to complete that chain. is that correct?

Edit 2: I stopped the last run and put the following parameters in stan_lmer (rstanarm function):
chains=3, cores=3, iter=500

Now it tells me

Gradient evaluation took 4.06 seconds
1000 transitions using 10 leapfrog steps per transition would take 40600 seconds.
Adjust your expectations accordingly!

which means 11 hours for 1000 iterations. I ran it for 500 iterations, 12 hours are over and it’s still running :/



Those timing hints are super misleading and I wish we’d get rid of them. It says that 1000 transitions at 10 leapfrog steps/transition. But you don’t know how many leapfrog steps you need—more as the models get more complicated.

Depends on what inferences you want to do. n_eff = 100 is usually plenty, dependin on what you’ll do.

Are you sure you’re not running out of memory?


No, I am running this on a box with 256GB memory, I would be surprised if that’s not enough.

I would be interested if/how HMC can be scaled up. At the moment, on this dataset lmer from lme4 finished the fit in less than 12 hours (havent stopped the time) but rstanarm has not completed after 48 hours


Well, there’s always approximate algorithms, ranging from very careful precise ones like INLA to rougher, black-box approaches like the ADVI built into Stan.

We’re actually working on three ways to speed up these GLM-type calculations.

  1. Running multiple cores, say on a cluster, and running with a GPU. As of CmdStan 2.18 (Rstan and PyStan coming soon), you can run multiple cores and break up the likelihood calculation. It’s nearly embarassingly parallel up to 100 cores over multiple machines with a good network. This could be a 100-fold speedup with 128 cores.

  2. Running on GPU. There’s a big matrix multiply in the middle here that can be parallelized very efficiently. This might be a 10-fold speedup.

  3. Compounds GLM functions. Adding analytic gradients to the basic GLMs can speed things up another factor of 2 or 3.

The neat thing is all of these are independent. So in a year or so when they’re all in, we should be about 1000 times faster and beefy hardware. That takes
daylong calculations down to 1.5 minutes or so.