New adaptive warmup proposal (looking for feedback)!

Here are some (unverified) thoughts I had when reading through the opening post:

Fixed size Windows are great, this probably solves the problem that sometimes the window size grows too fast (running large windows with a poorly adapted sampler takes very long and is unnecessary, it would be better to first adapt with short windows before running longer ones).

I think this adaptation routine will have problems with multimodal posteriors with strongly separated modes, as well as with phase changes (not sure about the terminology with respect to phase changes). It will probably tend to adapt to a local mode/phase and stop adaptation prematurely of forget important previous adaptation samples if the frequency of mode/phase switching is so low that it doesn’t always occur within each warmup window. Iif the posterior is suspected to exhibit this problem, then (very!) large windows must be chosen to ensure global adaptaion.

I think merging warmup between chains is probably best kept optional, as differences in warmup between chains can be useful for assessing the reliability of the inference.

About using effective sample size to decide about window length and about how much warmup is enougth, and its interaction with the mass matrix:

I think I read that nEff is mostly about the sample size with respect to calculating means or medians from the samples. What if the user is interested in the more extreme parts of the distribution, such as when computing credible intervals, or if he is interested in the correlations between parameters?

I suspect that the computation of nEff depends on the assumption that the correlation structure stays the same throught the sample. This is not the case if the sample spans several adaptation windows. Thus, I think a modified version of nEff is needed, which computes the ratio of nEff/n for each window separately and then uses the lowest of these ratios to estimate a “worst case nEff” for the entire multi-window period. Alternatively nEff could be calculated for each window separately, and they are summed when windows are merged.
The samples of different windows should be weighed (or thinned) according to the windows nEff/n ratio when calculating the metric (not forgetting that the computation of the metric employs regularisation), otherwise the “good” parts get overwhelmed by the bad ones and the metric ends up worse than the nEff calculation would suggest.

I’m not sure how the calculation of R^ within the warmup phase would be affected by this warmup routine.

1 Like

Thanks @Raoul-Kima for your thoughts. We have had similar thoughts many parts and below are comments where I differ

We are aware of the potential problems and fixes, but just didn’t have them yet here. We have diagnostic for the multimodality and if detected we can warn the user and let the user decide what to do. Strongly separated modes are problematic beyond this adaptation so this adaptation conditional on multimodal diagnostic is not making anything worse. Right now we recommend to test only with posterior with not strongly separated modes. We know the challenges also in case of not strongly separated modes. We are happy to see examples of problematic cases.

I’m not aware of anyone regularly looking at the differences in warmup between chains. Usually people look only differences after warmup between chains. It is not clear that just haring the information about scale of the posterior or local curvature of the posterior would make the chains during the warmup to resemble each other too much so that in case of problems we would falsely assume mixing is good. We are happy to see examples of problematic cases.

We are using both Bulk-ESS and Tail-ESS [1903.08008] Rank-normalization, folding, and localization: An improved $\widehat{R}$ for assessing convergence of MCMC
Although currently we are examining only lp__ during the warmup, as computing Rhats and ESSs for thousands or millions of parameters can take a lot of time. We could add an option that the use can decide which variables are analysed.

It’s just a warmup where we hope to get some speed-up, it’s not yet the actual sampling and more esoteric diagnostics can be left for that part.

We are aware of this and could take it into account. The good thing is that this is still just for the warm-up and small reduction in efficiency in favor of simplicity might not be a problem.

The bad ones should not be included at all in the approach.

I’m not sure where “this” refers? To our original suggestion or what you write in the last paragraph?

1 Like

@Raoul-Kima Oh crap, my brain has been nuked by the US elections and I forgot to respond.

@avehtari thanks for grabbing this.

Yeah, but we don’t really expect NUTS to work well with multimodal posteriors or things where curvature is changing anyway. In these cases I don’t really expect there to be a right Euclidean metric to pick!

I’m with Aki on this stuff. It’s not that the bad stuff never happened before in warmup. And it’s not like we were really checking before. Now we’re checking for it and trying to compensate.

It’s entirely possible that we hit models that are more fragile because of this, but that’s not what I’ve seen. In my small amount of experience, the accel_spline example in the original post adapts more reliably with this new stuff than the old stuff (I was gonna give details but it’s been so long since I’ve done the experiment I’m scared I’ll get them wrong – but I think this is true for the campfire implementation – still working on the MPI version).

2 Likes

I think a lot of what we’re doing here is automating what we recommend people do for robustness – checking Rhats and Neffs.

These are annoying to check especially with lots of parameters so I’m pretty stoked about that, at the very least.

4 Likes

Thanks for the Answers. Sounds great. Below I’ll only answer points where I have something to add.

Yeah, differences after warmup, caused by warmup. That’s what I meant. I don’t have a good example at the moment, which supports your view that’s it’s not very important.

I don’t think there is a clear line between good and bad, as every adaptation window will be slightly different. But maybe it’s clearer than I thought and/or not a problem in practice.

Not really anything specific. I just wanted to say that i didn’t spend any time thinking about how R^ might work with the multi-window thing, I only thought about nEff. I just mentioned it because I thought similar considerations probably apply there.

Sounds great!

nEff (or ESS) is not independent of Rhat as the computation uses the subcomponents of Rhat

So how is everyone feeling about campfire these days? The recent post inquiring about general recommendations for setting warmup duration for very-long-compute-time models reminded me about campfire and it’s seeming suitability for that case if we haven’t encountered any major issues with it yet.

Still in evaluation. @yizhang did the legwork to get most of it implemented with MPI in cmdstan a couple months ago, but I don’t think either of us have messed with it in a while: Cross-chain warmup adaptation using MPI

2 Likes

My plan was to add it to torsten and try some PMX applications but now with school closed kids take priority over codes.

8 Likes

A question/suggestion:
How is Treedepth intended to be handled during Warmup? I always had the impression that it doesn’t make sense to allow high treedepths before at least some basic warmup has taken place. I think this is one of the reasons why models sometimes take very long for the first iterations before speeding up massively during later warmup stages. Maybe there could be a rule that warmup starts with a low treedepth limit which then increases.

1 Like

A few other people have pointed this out as well (just to say it’s a good idea, not to say already-done). I like it too. The difficulty has been figuring out how to turn the treedepth back up. If you start adapting before the chains have really settled out it’s bad gnus, and some models really do need the big treedepth (I think the default of 10 is good).

I think at one point I tried looking at when the distribution of magnitudes of the momentums settled out. In HMC you get samples from p(p, q) and you just throw the p samples away. Because you know the p(p) distribution exactly (it’s your proposal), you can do a test. That’s the logic at least. Hope it’s correct.

I vaguely remember these distributions settled out too early, but I don’t remember how I tested this (I never put a decision rule in warmup, for instance), so that might be worth looking at again.

2 Likes

I think we are probably talking about two different problems here. I think you’re talking about the phase where the chain has to reach the typical set. What I had in mind is the phase after that, where it tries to find out how large the typical set is. Thus my following thoughts might only make sense for the latter. I hadn’t thought about the former. My impression was that even after the typical set is reached the treedepth should be kept small at first.

Just brainstorming about it, some quick ideas / theories / thoughts:

Having too low a treedepth has a similar effect as having too low a variance estimated for the parameter or too low a stepsize. Thus It probably doesn’t make sense to increase treedepth as long as any of the others is still (noticeably) increasing. So one way might be to increase the treedepth by a fixed amount after every warmup window in which the variance and stepsize don’t increase much.

Hmm I dunno I do think the lower treedepth thing is most appropriate for very early warmup (where you’re just trying to get close to a solution). Once you’re vaguely there, MCMC away.

The high treedepths help the MCMC make up for the bad adaptation. The bad adaptation means you’ll probably end up taking more leapfrogs per draw than in sampling.

I think the gamble would be that the momentum resamplings are more important in early adaptation than getting U-turns, which is probably something that varies model to model. Will be curious how things turn out whatever you decide to do.

What I thougth is that there sometimes is the situation that it has already found the typical set, but it’s much larger then the sampler knows. I think one problem that can occur with the current warmup implementation (I think at some point I tested to verify this actually happens) is that it can only increase the variance estimate by a certain amount each adaptation phase (how much depends on the treedepth and other things). So depending on how bad the estimated variance is (and other influencing factors) it can take several adaptation phases to get the correct estimate. The problem is that as long as the variance estimate is much too low it runs very deep trees until it has increased the value sufficiently, which takes a long time. Running this deep trees is not necessary to increase the value, running a higher number of adaptation phases with shorter trees is more efficient.
Of course this is just one particular issue that can occur. I haven’t thought too much about this recently, just wanted to point out that this particular effect exists. There are other issues that might make it necessary to accept this particular one. For example something that I also once encountered was that the estimated variance actually decreased after each adaptation phase if the treedepth limit was set too low, even though the true variance of the posterior was much larger (I think it was because there were correlations that slowed exploration).

So in summary: There seems to be at least one reason (which I tried to explain) why it could make sense to restrict treedepth for a while even after the typical set has been found.

Yes, but if you can decide between better adaptation and higher treedepth then better adaptation is the better option. My point was that in some situations you can improve the adaptation without using a high treedepth, and then only use treedepth as a solution once you don’t get further along that path anymore.

I’m not sure what you mean. I don’t have a serious project on adaptation. I’m just forwarding some thoughts I had to you, since you seem to be working on that topic.

Yeah this makes sense. @avehtari talks about this as models that are rate limited by the momentum resampling more than the HMC trajectories.

Oh oh I see! @Lu.Zhang is working on adaptation stuff now (pinging her so she see’s this).

1 Like

Hi Ben,

Thanks for pinging me! A very interesting discussion on the treedepth thing. I haven’t got much time to follow up the project since Andrew brought it out in our last meeting, neither has Andrew. But I will keep an eye on the related topics. Thank you so much for your help!

Best,

Lu Zhang

1 Like

The way I see it the limitation is kinda the number of adaptation phases. Sometimes you in a way just need a certain number of those, and getting that done with low treedepth is cheaper than with high treedepth. High treedepth can reduce the number of adaptation phases required, but is more expensive than running more phases with shorter trees. In such a situation the current warmup routine can take a long time because it has both a high treedepth and also ramps up the number of samples per warmup phase very quickly very early.
Of course at some point treedepth has to be increased, it just shouldn’t be too early.

1 Like

Do you have any experiments doing this? You can limit max tree depth and adjust the windows to see how it works compared to not limiting tree depth.

It would be harder to evaluate something that varied, but as long as it did it per block, it’s possible to pull out the currently adapted metric (inverse mass matrix) and stepsize, so it’d be possible to restart at the new position.

In phase I, we need long enough chains to not devolve to a random walk. Otherwise convergence is going to be quadratic rather than nearly linear.

In phase II, we need good enough mxing to evaluate covariance.

Phase III is just step size, so covariance estimates are fixed at that point.

Well hopefully the stuff at the top of the thread addresses this a bit. Instead of doubling to add more warmup draws, warmup draws are added in fixed size hunks and the previous draws aren’t immediately thrown away.

So then there’d need to be a criteria for increasing treedepth added on top.

I’m 99% sure that I did experimentally confirm this behaviour, and that limiting treedepth did indeed help. I only tested a few particular cases though, that were created to specifically provoke this issue. I didn’t do broad testing on real world models.

In general, I brought up the point about treedepth not because I have any substancial insight on it, but because I had the impression that it is worth considering and I didn’t see it mentioned in this thread so far.

The idea was that this could fit between phases 1 and 2, so it would start after convergence and before anything else. It would just serve to give phase 2 a better starting point, basically (a better starting estimate of the metric). Since treedepth would be limited it wouldn’t take much computation time. The question is whether it would bring any other drawbacks, like for example actually making the metric worse sometimes.

Yes, one of the reasons why I’m quite excited about the campfire project. The treedepth thing would be much less important then, hopefully. But I suspect it would still make a difference.