I’ve seen a few places say that we shouldn’t / can’t rely on the warmup phase of NUTS for our posteriors, but I don’t understand why this is so.
My understanding of NUTS is that the adaptation phase is running HMC under some settings, then change those settings, and repeat until it is done - in which case, the model should converge during warmup, but at different speeds. When I look at traceplots, it still looks like it converges during warmup.
Is it because the warmup phase is deliberately pushing the step size to a large enough point that it has many divergent transitions, and therefore the sampler isn’t exploring the probability space properly, even though it looks to have got to more or less the right area (i.e. it knows where the hill is, but not what the hill actually looks like)?
In most cases step 1 will happen relatively quickly, and depending on the complexity of your posterior as well as the choices of initialisation values, step 2 can take somewhat longer.
This Step 2 can be seen visually in this below image (taken from Michael Betancourts excellent intro to HMC):
Here, the left of the image shows the iterations of a sampler first finding the typical set, and then exploring it. The right of the image shows the bias of the estimate as the number of iterations increases.
Naturally, to reduce bias in our inferences we discard the early iterations where the sampler was still identifying the typical set. The current default of discarding 50% of samples is more of a safe and conservative choice to be as confident as possible that no biasing iterations remain (this has been the default across programs such as JAGS and BUGS, where these iterations were called ‘burn-in’).
If it appears that the sampler is converging quickly, then you can choose a shorter warmup length (e.g., only discarding the first 25% of samples). But I would strongly recommend performing a sensitivity analysis and checking that your inferences and estimates are consistent with a longer warmup - just to be safe.
Just a quick note that if you try to set your warmup as 150 iterations or less, Stan will show a warning - since these are the default iterations used for the tuning of HMC parameters.
Hopefully that answers your question!
@betanalpha would also be able to provide a much rigorous perspective here, so I’m just tagging in case there are any key details that I’ve missed
@andrjohns did you mean that (2) happens relatively quickly and (1) takes longer? The dual averaging for the step size settles pretty quickly, but the windowed adaptation for the metric is slow (and then the step size needs to be re-updated every time the metric updates). Both the dual averaging and the metric adaptation can only adapt to the posterior itself once the typical set has been found.
@David2 I’m pretty sure there’s no guarantee of detailed balance when warmup is still happening, so even if standard convergence diagnostics look ok (and Stan’s folded r-hats should be sensitive here), my guess is that you’d meet with some skepticism if you were to try to use warmup samples for inference.
Thanks both for the great responses. Let’s say that I ask Stan to do 1000 warmup iterations and 1000 sampling iterations:
Tuning the HMC parameters uses 150 iterations.
Finding the typical value of the posterior uses 250 iterations.
Actual sampling from the posterior for 600 iterations, but labelled as warmup.
Actual sampling for my 1000 pre-specified iterations.
Stan will then try to give me only the 1000 I originally asked for, even though 1600 are ‘good’ samples.
I suppose my question is really in two parts:
When Stan is tuning the HMC parameters, is it making any attempt at convergence, or is it testing one set of HMC parameters, then going back to the initial value and seeing how it gets on?
If I am happy that I have found convergence before my warmup period is finished, is there any good reason to not use those 600 warmup samples? Put another way, is Stan doing something different for those 600 warmup samples compared to the 1000 actual samples?
If you ask Stan for 1000 warmup iterations, it will spend all 1000 of those iterations tuning the step size, and most of them (I think iterations 76-950 by default) tuning the mass matrix.
When warmup finishes, all of the tuning stops and the HMC parameters are fixed for the entire sampling phase. It is at this moment that proofs of detailed balance begin to apply.
The convergence of a Markov chain actually proceeds in three basic phases – first the Markov chain has to find the typical set, then it has to run long enough to expand into the typical set, then finally it starts to refine the exploration. For more discussion and figures see Section 3.2.1 of Markov Chain Monte Carlo in Practice.
Historically “burn-in” has referred to that first phase, and excluding the “burn-in” iterations effectively removes the bias from including points far outside of the typical set. One of the benefits of Hamiltonian Monte Carlo is that the momentum resampling allows Markov chains to fall into the typical really quickly and so in theory only a small number of iterations should need to be discarded.
Stan’s “warmup”, however, does more than just find the typical set. In fact by default Stan’s warmup allocates only 100 iterations to finding the typical set! The rest of the time it dedicated to adapting the sampler configuration, in particular the numerical Hamiltonian integrator step size and the inverse metric that defines the kinetic energy. This adaptation is informed by exploration of the typical set, in particular Markov chain Monte Carlo estimates built up from Markov chains that have found the typical set. In particular the adaptation needs to run long enough for Markov chain Monte Carlo estimators to become somewhat accurately and inform useful updates to the sampler configuration. This takes time and the bulk of Stan’s warmup is dedicated to this adaptation.
Technically these latter warmup iterations do contain information about the typical set. The problem is that we don’t know how to combine them together to inform consistent posterior expectation value estimates. More formally during adaptation the sampler adaptation is continuously evolving which means that the iterations are generated from different Markov chains. Almost all of our Markov chain Monte Carlo theory, however, comes from working with a Markov chain generated by a fixed transition*.
[ * A technical note: Stan’s sampler doesn’t technically satisfy detailed balance, which is a sufficient but not necessary condition for preserving the target distribution. An adapting Markov transition that does satisfy detailed balance, however, would do so at each iteration regardless of its configuration. The problem is that it’s not the same detailed balance condition that is being satisfied across iterations. Definitely a pedantic point! ]
There is some theory regarding Markov chain Monte Carlo estimators from adapting Markov transitions but it is very limited and in practice the techniques have proven to be far too fragile to be of much benefit.
Anyways @andrjohns and @jsocolar have already said the bulk of warmup cost goes into adapting the sampler to the specific target distribution which is inherently expensive as it requires decent quantifications of the typical set. Finding the typical set is usually quite fast. That said for particularly nasty problems the sampler might not find the typical set within the initial 100 iteration window in which case the adaptation would be biased by behavior external to the typical set. This is often the cause for poor adaptation and can sometimes be resolved by tweaking the warmup configuration or the initialization procedure.