Hi all,
I’m fitting a big hierarchical nonlinear model, on ~18k points and a bunch of groups: about 12k parameters, and p_loo = 1879. It seems to be doing a great job: my fits look excellent, and the parameters seem reasonable. I’m busy working on simulated data to test it now, but based on the fits and the posterior predictive distribution, I’m not feeling concerned. I typically either get 0 or 1 divergences over 6000 iterations (and adapt_delta=0.90), so that’s not really flagging anything either. Also, I’m only really concerned with my population level fixed effects, so some degree of bias at the level of individual data points, or individual random effects is of little concern for me if it doesn’t greatly affect my fixed effects estimation.
However, I’m saturating my max_treedepth: 5999 of 6000 iterations reach the max_treedepth. It’s also taking about 4 or 5 days to fit. This is do-able, but it’s quite annoying for model-building, realising a week later that I need an extra variable. I tried turning up the max_treedepth to 15, and the sampling time is much, much slower. After 24 hours, I haven’t even reached 100 iterations. I don’t even know how to start looking into what’s happening with the higher treedepth, because it’ll just take so long to sample anything at all.
Preliminary questions:
- Is saturating the treedepth something I should really even be concerned about? I understand that it’s about efficiency, but sampling for a month seems like it might not be the most efficient use of the sampling time.
- Are there other ways that I should be attacking this problem without increasing the treedepth? Reparameterisation? Tightening priors? Other control parameters?
I read through the discussion here (Setting Max Treedepth in difficult high-dimensional models), which seems relevant. @dlakelan was artificially reducing the max_treedepth, and seeing significant speed-ups. And then they could later implement the full model with a better treedepth, or even deciding on a good compromise to optimise the number of effective samples based on the computing time. I suspect that my model shares some of the same characteristics: several very tightly-identified parameters with small variance, for which leaving the typical set takes us to badlands where probability goes to die. The parameters are also quite highly correlated in some cases, which probably complicates the geometry further. I’m actually running inits=0 to be rather more sure that the sampler starts within the typical set, as I had issues with an earlier (smaller) model, where one of the chains could become completely lost and have 100% divergent transitions out in the wilderness, while the others had none.
The discussion above comes from ~3 years ago, and there was some talk of saving warmup samples for reuse, or a “run for 8 hours” mode. It’s also around the upper limit of my level of understanding of how to think about how to resolve these kinds of issues, so maybe I’m not understanding something there correctly. Does the proposal of messing around with artificially lowering the max_treedepth, or just leaving it where it is, come with any other serious downsides that I should be considering? And, based on the comment before, are my saturated treedepths really so problematic after all if everything else is doing ok?
Thanks so much in advance for any help!
I’m using cmdstanr, with cmdstan v2.24.1 by the way.