Stan_glmer speed for large sample sizes

I am using stan_glmer (rstanarm package) with a dataset that has 35,000 people. There are a bunch of fixed and random effects in the model, with a mixture of continuous, factors, and dummy variables. For example:

mod <- stan_glmer(outcome~var1+var2+var3+var4+var5+var6+var7+var8+
                    (1|var20) + (1|var21) + (1|var22) + (1|var23) + (1|var24) +
                    (1|var25)+(1|var26)+(1|var27)+(1|var28), data=dat, binomial(link = 'logit'),
                  prior_intercept = normal(0,1), prior = normal(0,1),
                  cores = 5)

I am running it on a fairly powerful server. However, it is taking a very long time to finish. In fact, I had to cancel the process after 4 hours.

Aside from increasing the cores, are there any methods of making this run faster. The dataset is longitudinal and it will get even larger by November.


Before focusing on making the computation faster, you should verify that the results you’re getting from your model are reasonable.

Frequently a badly specified model is slow.

It could be that there are identifiability problems or other things going on that are making your sampling slow.

Since the big model is too slow, you should start with a smaller model and build up from there. Presumably there is a small model you can play with that will run fast and then you can add stuff on until things get slow. This should give you more insight into what part of the model is making stuff slow.

35000 is a lot of observations. I wouldn’t be surprised if in the end the model takes a few hours to run.

1 Like

N = 35,000 is not so big, and it should not take hours if there is only one batch of varying intercepts. But with multiple batches, sure, maybe.

Here are some suggestions, in no particular order:

  1. As Ben says, simulate fake data from the model and try fitting your model to the fake data.

  2. Simplify the model. First fit the model with no varying intercepts. Then add one batch of varying intercepts, then the next, etc.

  3. Run for 200 iterations, not 2000. Eventually you can run for 2000 iterations, but no point in doing that while you’re still trying to figure out what’s going on.

  4. Put priors on the group-level variance parameters.

  5. Consider some interactions of the group-level predictors. It seems strange to have an additive model with 14 terms and no interactions.

  6. Fit the model on a subset of your data.