A question/suggestion:
How is Treedepth intended to be handled during Warmup? I always had the impression that it doesn’t make sense to allow high treedepths before at least some basic warmup has taken place. I think this is one of the reasons why models sometimes take very long for the first iterations before speeding up massively during later warmup stages. Maybe there could be a rule that warmup starts with a low treedepth limit which then increases.
A few other people have pointed this out as well (just to say it’s a good idea, not to say already-done). I like it too. The difficulty has been figuring out how to turn the treedepth back up. If you start adapting before the chains have really settled out it’s bad gnus, and some models really do need the big treedepth (I think the default of 10 is good).
I think at one point I tried looking at when the distribution of magnitudes of the momentums settled out. In HMC you get samples from p(p, q)
and you just throw the p
samples away. Because you know the p(p)
distribution exactly (it’s your proposal), you can do a test. That’s the logic at least. Hope it’s correct.
I vaguely remember these distributions settled out too early, but I don’t remember how I tested this (I never put a decision rule in warmup, for instance), so that might be worth looking at again.
I think we are probably talking about two different problems here. I think you’re talking about the phase where the chain has to reach the typical set. What I had in mind is the phase after that, where it tries to find out how large the typical set is. Thus my following thoughts might only make sense for the latter. I hadn’t thought about the former. My impression was that even after the typical set is reached the treedepth should be kept small at first.
Just brainstorming about it, some quick ideas / theories / thoughts:
Having too low a treedepth has a similar effect as having too low a variance estimated for the parameter or too low a stepsize. Thus It probably doesn’t make sense to increase treedepth as long as any of the others is still (noticeably) increasing. So one way might be to increase the treedepth by a fixed amount after every warmup window in which the variance and stepsize don’t increase much.
Hmm I dunno I do think the lower treedepth thing is most appropriate for very early warmup (where you’re just trying to get close to a solution). Once you’re vaguely there, MCMC away.
The high treedepths help the MCMC make up for the bad adaptation. The bad adaptation means you’ll probably end up taking more leapfrogs per draw than in sampling.
I think the gamble would be that the momentum resamplings are more important in early adaptation than getting U-turns, which is probably something that varies model to model. Will be curious how things turn out whatever you decide to do.
What I thougth is that there sometimes is the situation that it has already found the typical set, but it’s much larger then the sampler knows. I think one problem that can occur with the current warmup implementation (I think at some point I tested to verify this actually happens) is that it can only increase the variance estimate by a certain amount each adaptation phase (how much depends on the treedepth and other things). So depending on how bad the estimated variance is (and other influencing factors) it can take several adaptation phases to get the correct estimate. The problem is that as long as the variance estimate is much too low it runs very deep trees until it has increased the value sufficiently, which takes a long time. Running this deep trees is not necessary to increase the value, running a higher number of adaptation phases with shorter trees is more efficient.
Of course this is just one particular issue that can occur. I haven’t thought too much about this recently, just wanted to point out that this particular effect exists. There are other issues that might make it necessary to accept this particular one. For example something that I also once encountered was that the estimated variance actually decreased after each adaptation phase if the treedepth limit was set too low, even though the true variance of the posterior was much larger (I think it was because there were correlations that slowed exploration).
So in summary: There seems to be at least one reason (which I tried to explain) why it could make sense to restrict treedepth for a while even after the typical set has been found.

The high treedepths help the MCMC make up for the bad adaptation. The bad adaptation means you’ll probably end up taking more leapfrogs per draw than in sampling.
Yes, but if you can decide between better adaptation and higher treedepth then better adaptation is the better option. My point was that in some situations you can improve the adaptation without using a high treedepth, and then only use treedepth as a solution once you don’t get further along that path anymore.

Will be curious how things turn out whatever you decide to do.
I’m not sure what you mean. I don’t have a serious project on adaptation. I’m just forwarding some thoughts I had to you, since you seem to be working on that topic.

Running this deep trees is not necessary to increase the value, running a higher number of adaptation phases with shorter trees is more efficient.
Yeah this makes sense. @avehtari talks about this as models that are rate limited by the momentum resampling more than the HMC trajectories.

I’m just forwarding some thoughts I had to you, since you seem to be working on that topic.
Oh oh I see! @Lu.Zhang is working on adaptation stuff now (pinging her so she see’s this).
Hi Ben,
Thanks for pinging me! A very interesting discussion on the treedepth thing. I haven’t got much time to follow up the project since Andrew brought it out in our last meeting, neither has Andrew. But I will keep an eye on the related topics. Thank you so much for your help!
Best,
Lu Zhang

@avehtari talks about this as models that are rate limited by the momentum resampling more than the HMC trajectories.
The way I see it the limitation is kinda the number of adaptation phases. Sometimes you in a way just need a certain number of those, and getting that done with low treedepth is cheaper than with high treedepth. High treedepth can reduce the number of adaptation phases required, but is more expensive than running more phases with shorter trees. In such a situation the current warmup routine can take a long time because it has both a high treedepth and also ramps up the number of samples per warmup phase very quickly very early.
Of course at some point treedepth has to be increased, it just shouldn’t be too early.

Of course at some point treedepth has to be increased, it just shouldn’t be too early.
Do you have any experiments doing this? You can limit max tree depth and adjust the windows to see how it works compared to not limiting tree depth.
It would be harder to evaluate something that varied, but as long as it did it per block, it’s possible to pull out the currently adapted metric (inverse mass matrix) and stepsize, so it’d be possible to restart at the new position.
In phase I, we need long enough chains to not devolve to a random walk. Otherwise convergence is going to be quadratic rather than nearly linear.
In phase II, we need good enough mxing to evaluate covariance.
Phase III is just step size, so covariance estimates are fixed at that point.

The way I see it the limitation is kinda the number of adaptation phases.
Well hopefully the stuff at the top of the thread addresses this a bit. Instead of doubling to add more warmup draws, warmup draws are added in fixed size hunks and the previous draws aren’t immediately thrown away.
So then there’d need to be a criteria for increasing treedepth added on top.

Do you have any experiments doing this? You can limit max tree depth and adjust the windows to see how it works compared to not limiting tree depth.
I’m 99% sure that I did experimentally confirm this behaviour, and that limiting treedepth did indeed help. I only tested a few particular cases though, that were created to specifically provoke this issue. I didn’t do broad testing on real world models.
In general, I brought up the point about treedepth not because I have any substancial insight on it, but because I had the impression that it is worth considering and I didn’t see it mentioned in this thread so far.

In phase I, we need long enough chains to not devolve to a random walk. Otherwise convergence is going to be quadratic rather than nearly linear.
In phase II, we need good enough mixing to evaluate covariance.
Phase III is just step size, so covariance estimates are fixed at that point
The idea was that this could fit between phases 1 and 2, so it would start after convergence and before anything else. It would just serve to give phase 2 a better starting point, basically (a better starting estimate of the metric). Since treedepth would be limited it wouldn’t take much computation time. The question is whether it would bring any other drawbacks, like for example actually making the metric worse sometimes.

Well hopefully the stuff at the top of the thread addresses this a bit.
Yes, one of the reasons why I’m quite excited about the campfire project. The treedepth thing would be much less important then, hopefully. But I suspect it would still make a difference.