Stan_glmer speed for large sample sizes

I am using stan_glmer (rstanarm package) with a dataset that has 35,000 people. There are a bunch of fixed and random effects in the model, with a mixture of continuous, factors, and dummy variables. For example:

mod <- stan_glmer(outcome~var1+var2+var3+var4+var5+var6+var7+var8+
                    var9+var10+var11+var12+var13+var19+
                    (1|var20) + (1|var21) + (1|var22) + (1|var23) + (1|var24) +
                    (1|var25)+(1|var26)+(1|var27)+(1|var28), data=dat, binomial(link = 'logit'),
                  prior_intercept = normal(0,1), prior = normal(0,1),
                  cores = 5)

I am running it on a fairly powerful server. However, it is taking a very long time to finish. In fact, I had to cancel the process after 4 hours.

Aside from increasing the cores, are there any methods of making this run faster. The dataset is longitudinal and it will get even larger by November.

Thanks

1 Like

Before focusing on making the computation faster, you should verify that the results you’re getting from your model are reasonable.

Frequently a badly specified model is slow.

It could be that there are identifiability problems or other things going on that are making your sampling slow.

Since the big model is too slow, you should start with a smaller model and build up from there. Presumably there is a small model you can play with that will run fast and then you can add stuff on until things get slow. This should give you more insight into what part of the model is making stuff slow.

35000 is a lot of observations. I wouldn’t be surprised if in the end the model takes a few hours to run.

2 Likes

N = 35,000 is not so big, and it should not take hours if there is only one batch of varying intercepts. But with multiple batches, sure, maybe.

Here are some suggestions, in no particular order:

  1. As Ben says, simulate fake data from the model and try fitting your model to the fake data.

  2. Simplify the model. First fit the model with no varying intercepts. Then add one batch of varying intercepts, then the next, etc.

  3. Run for 200 iterations, not 2000. Eventually you can run for 2000 iterations, but no point in doing that while you’re still trying to figure out what’s going on.

  4. Put priors on the group-level variance parameters.

  5. Consider some interactions of the group-level predictors. It seems strange to have an additive model with 14 terms and no interactions.

  6. Fit the model on a subset of your data.

3 Likes

Since I have made progress since I originally posted this, I wanted to circle back and explain what I found. This is largely for any beginners like me that are searching around for solutions. That said, please let me know if I am mistaken or if there is additional information that would be helpful.

To recap: I was having difficulties determining why it took a prohibitively long time to run multi-level models with large datasets (~50k). I could run it overnight and it still wouldn’t finish. When it did finish, I always got the divergent transition warning (the bane of my existence).

I tried all the common suggestions to diagnose these problems, to no avail. I found many tips on how to address the divergent transitions. I increased adapt_delta and increased the iterations; I made sure to rescale all my data; and I stared endlessly at pair plots. Nothing helped. I identified the divergent transitions in the pair plots, but I didn’t know how to translate that into a concrete remedy, other than removing some of the variables. There were fewer specific suggestions on how to fix the slowness, but I started with taking smaller samples and fitting a more parsimonious model. I found that the model would often be just fine with smaller samples (although often with divergent transitions). After adding more to the sample, without changing the model parameters, it suddenly was as slow as molasses.

After a while of tinkering around and peeking into every nook and cranny of the web to find a fix, I stumbled into the solution for both the speed and divergent transitions: add more levels to the grouped variables. Halleluiah—that did the trick!

Sometimes this can be difficult. If you are using multi-level modeling to estimate heterogeneous treatment effects (like I was), you may be using predictors like gender and race that don’t have many levels. I either concatenated a few of the variables together or cut something like age into more buckets. I am not saying this is necessarily optimal, especially concatenation, but boy did it work. Models that used to run all night, and end up having divergent transitions, were given a dose of pure nitro. They finished in around an hour or two. And. No. Divergent. Transitions! Victory at last.

I was generally aware that you need enough levels for your predictors, but I didn’t know it could cause all these issues. Also, what confused me is that typically increasing the sample size and reducing the complexity of a, say, OLS model is typically associated with far fewer problems. In this case, I suppose I needed more complexity.

If I am way off base, let me know. I just wanted to continue the accumulation of knowledge.

2 Likes

Hi, this is an interesting topic, and the above post reminds me of lots of things.

  1. It’s too bad that increasing the number of iterations was one of the common suggestions. Generally we don’t recommend increasing the number of iterations. Increasing the number of iterations is everyone’s first idea, and it can make sense with Gibbs or Metropolis, but it’s typically not a good idea for HMC. If you’re having convergence problems with 2000 iterations, it will probably be a waste of time to try 20,000.

  2. Increasing adapt_delta can make sense, but I don’t think it should be one of the first things you try.

  3. Our main recommendation is to use stronger priors, because computational instability comes when the priors and the likelihood are weak. As you write, with only 2 or 3 levels of a factor, you can get instability in estimating the group-level variance. That instability can be cured with a strong prior.

  4. Sometimes, of course, you don’t have strong prior information available. In that case, I still recommend you use a strong prior. But if you’d like, you can call it a “soft constraint” rather than a “prior”: the point is that it (softly) constrains the set of models you’re fitting. And usually you can have some sense of bounds for that. It will be good for us to write a case study with an example of this.

  5. We generally don’t recommend taking a factor with 2 levels and using a multilevel model for that factor. For example, instead of having a variable called sex with 2 levels, we’d have a variable called male (so that female is the default) as a regression coefficient.

  6. Your idea of combining factors is a bit of a kluge but it doesn’t seem like a bad idea to me.

5 Likes