Additive random effect models with large data sets

But I guess this trick only works in the case of a normally distributed outcome? Are there any speed-ups for binary data?

edit: to confirm, with rstanarm v2.18.9 warmup took 0.11s with N=2.6 million observations and 6 predictors


Not that big speed-ups. There is a github branch for rstanarm which uses bernoulli_logit_glm compound function which is 4 times faster. Eventually there will GPU version of that which will provide additional speedup if you have GPU card. There are potential fast approximations which could be combined with importance sampling, but that would require some thinking and work.


1 Like

That sounds cool. Right now, with ~1.2 million observations, the binomial model takes about 63 hours to run. If I standardize my outcome and use a Gaussian model instead, it takes ~114 seconds, warmup + sampling. The coefficients can be transformed into the ~log-odds by the formula \beta / (\mu * (1-\mu)), where \mu is the case prevalence. I am wondering, however, if this is also a feasible approach for model selection/validation using loo?

edit: I should add this is what is often done in genomics, see e.g. and

Cool! This transformation is likely to be safe for that many observations.

Yes. We’ll soon have loo package supporting sub-sampling-LOO which will provide significant speed-up for 1.2 million observations. We’ve tested it with comparison of models with 1 million observations.

Popping my head into this discussion. Could you explain a little bit on standardizing a binary outcome and running it as a linear model? That speed advantage is unbelievable!

Sorry for the late reply. I should maybe have added that this is only well described in the case of small effect sizes (i.e. odds between 1-1.3), so it is not something that will work for all problems. More details are also available in