Scalability of bayesian glmm

techniques
performance

#1

I am currently learning stan from a book called “Bayesian Models for Astrophysical Data” and I am totally hooked. However when I ran stan on a simple linear mixed model

y_{i,j} = \beta_0 + \mathbf{x}_{i,j}^T\beta + \alpha_i + \epsilon_{i,j}

with 4500 observations and a total of 3 fixed effect and 20 random effect coefficients (+ 2 coefficients from a prior on each group of coefficients) it took quite some time (~4.7mins). I used the following mcmc config

fit = pystan.stan(model_code=stan_code, data=model_data, iter=10000, chains=3, thin=10, warmup=6000, n_jobs=3)

In one of the applications where I want to use the model I have 1.1m observations and probably at least 10k parameters (because there are 2 hierarchies with 3000 and 1000 categories each). Is this doable at all in stan?

Thanks!


#2

I think it’s feasible but it might require a lot of patience (or additional computational resources). :-)

There are actually a number of things you can do to optimise the performance of your current set-up. You probably don’t need to thin the chains and you can try with iter = 1000, warmup = 600. The number of iterations you need will depend on how precise you want the estimate of your parameters (or quantities of interest to be). The precision again depends on N_eff.

There are efficient and inefficient implementations of linear mixed models. The Stan manual gives a lot of possible improvements. If you are comfortable with R, you can also try the Rstanarm or brms packages which have pretty efficient implementations.

Sometimes, you can simplify linear mixed models by summarising your data per group in your lowest level in the hierarchy. I think there are some sections in the manual on sufficient statistics.

(Weakly) informative priors can also improve the speed (and efficiency in terms of N_eff).