Since I have made progress since I originally posted this, I wanted to circle back and explain what I found. This is largely for any beginners like me that are searching around for solutions. That said, please let me know if I am mistaken or if there is additional information that would be helpful.
To recap: I was having difficulties determining why it took a prohibitively long time to run multi-level models with large datasets (~50k). I could run it overnight and it still wouldn’t finish. When it did finish, I always got the divergent transition warning (the bane of my existence).
I tried all the common suggestions to diagnose these problems, to no avail. I found many tips on how to address the divergent transitions. I increased adapt_delta and increased the iterations; I made sure to rescale all my data; and I stared endlessly at pair plots. Nothing helped. I identified the divergent transitions in the pair plots, but I didn’t know how to translate that into a concrete remedy, other than removing some of the variables. There were fewer specific suggestions on how to fix the slowness, but I started with taking smaller samples and fitting a more parsimonious model. I found that the model would often be just fine with smaller samples (although often with divergent transitions). After adding more to the sample, without changing the model parameters, it suddenly was as slow as molasses.
After a while of tinkering around and peeking into every nook and cranny of the web to find a fix, I stumbled into the solution for both the speed and divergent transitions: add more levels to the grouped variables. Halleluiah—that did the trick!
Sometimes this can be difficult. If you are using multi-level modeling to estimate heterogeneous treatment effects (like I was), you may be using predictors like gender and race that don’t have many levels. I either concatenated a few of the variables together or cut something like age into more buckets. I am not saying this is necessarily optimal, especially concatenation, but boy did it work. Models that used to run all night, and end up having divergent transitions, were given a dose of pure nitro. They finished in around an hour or two. And. No. Divergent. Transitions! Victory at last.
I was generally aware that you need enough levels for your predictors, but I didn’t know it could cause all these issues. Also, what confused me is that typically increasing the sample size and reducing the complexity of a, say, OLS model is typically associated with far fewer problems. In this case, I suppose I needed more complexity.
If I am way off base, let me know. I just wanted to continue the accumulation of knowledge.