Additive random effect models with large data sets

David_Westergaard · May 24, 2019, 3:00pm

But I guess this trick only works in the case of a normally distributed outcome? Are there any speed-ups for binary data?

edit: to confirm, with rstanarm v2.18.9 warmup took 0.11s with N=2.6 million observations and 6 predictors

avehtari · May 25, 2019, 2:27pm

Correct,

Not that big speed-ups. There is a github branch for rstanarm which uses bernoulli_logit_glm compound function which is 4 times faster. Eventually there will GPU version of that which will provide additional speedup if you have GPU card. There are potential fast approximations which could be combined with importance sampling, but that would require some thinking and work.

Excellent!

David_Westergaard · May 31, 2019, 2:50pm

That sounds cool. Right now, with ~1.2 million observations, the binomial model takes about 63 hours to run. If I standardize my outcome and use a Gaussian model instead, it takes ~114 seconds, warmup + sampling. The coefficients can be transformed into the ~log-odds by the formula \beta / (\mu * (1-\mu)), where \mu is the case prevalence. I am wondering, however, if this is also a feasible approach for model selection/validation using loo?

edit: I should add this is what is often done in genomics, see e.g. https://data.broadinstitute.org/alkesgroup/BOLT-LMM/ and Transformation of Summary Statistics from Linear Mixed Model Association on All-or-None Traits to Odds Ratio - PMC.

avehtari · May 31, 2019, 5:50pm

Cool! This transformation is likely to be safe for that many observations.

Yes. We’ll soon have loo package supporting sub-sampling-LOO which will provide significant speed-up for 1.2 million observations. We’ve tested it with comparison of models with 1 million observations.

dilsherdhillon · June 4, 2019, 9:32pm

Popping my head into this discussion. Could you explain a little bit on standardizing a binary outcome and running it as a linear model? That speed advantage is unbelievable!

David_Westergaard · June 15, 2019, 1:46pm

Sorry for the late reply. I should maybe have added that this is only well described in the case of small effect sizes (i.e. odds between 1-1.3), so it is not something that will work for all problems. More details are also available in https://projecteuclid.org/euclid.aoas/1365527203

Topic		Replies	Views
Splines in Rstanarm General	4	900	April 23, 2021
Stan_glmer speed for large sample sizes rstanarm	4	2107	April 19, 2021
Very slow MCMC sampling in stan_jm() due to large number of individuals Modeling fitting-issues , performance , mixed-model , rstanarm , paralellization	6	1220	April 30, 2021
Generalised Additive Modelling (GAMs) in Stan General rstan , techniques	4	1492	January 24, 2023
Improving Performance on Logistic Regression with Informative Priors Modeling performance , rstanarm	4	1573	May 1, 2020

Additive random effect models with large data sets

Related topics