Post-warmup acceptance approaches zero

Hi all,

I’m a fairly experienced Stan user, but I’ve come across a new issue that I haven’t seen before or found documented elsewhere, and I am looking for some troubleshooting insights. Specifically, the acceptance probability after warmup falls from moderate to close to zero, soon after the start of sampling.

I’m running a variant of a dynamic LDA model (which I know is difficult for NUTS due to the multimodal posterior). I’ve built the model up bit by bit, and have found Stan does a good job of estimating nested submodels. However, in my final specification (which includes one additional dynamic component), I find that the post-warmup acceptance probability quickly falls to zero.

I’m running this using cmdstan-2.18. Checking the output, there are several things of note:

  • The acceptance probability approaches zero within 50 iterations of post-warmup sampling, but there are no divergences or anything else. Moreover, Stan doesn’t return any messages after sampling. (I’m not sure if I should expect such messages using cmdstan? I typically use rstan)

  • The first few draws after warm-up have an acceptance probability below the target (~0.6 instead of the target 0.95)

  • The sampler hits the maximum treedepth (10).

  • Even though the acceptance probability collapses to zero, if you look at the point estimates where it gets stuck, they are similar to the point estimates from the submodels (for the parameters that are the same). On the submodels, there was no such acceptance rate problem.

This is a fairly complex model that takes a while to run, which makes it a bit difficult to troubleshoot via experimentation, which would be my normal approach. I’m thus wondering if these sampler pathologies are representative of any specific issue. Has anyone encountered something similar? Is this problem typical of…

  1. the warm-up hasn’t worked and needs longer for adaptation (since the model takes a while to run and I’m still experimenting with the specification, I’ve only been running short chains)?

  2. setting too low of a maximum treedepth?

  3. something else?

Any insights are appreciated.

Thanks very much,
Ryan

If you never hit the maximum treedepth, then it is not too low. If not (1), then I would guess (3): the ultimate model you are estimating is not amenable to MCMC.

The model does hit the maximum treedepth, and I’m running a version where that’s increased. But it takes a while to run, so I’m trying to understand the potential drivers of the problem in the meantime, as I’m not sure that could be the central problem. My understanding is that maximum treedepth is just a matter of efficiency: could hitting the max treedepth alone be responsible for collapsing acceptance rate after warmup?

I know the model is not ideally suited for NUTS, but the nested submodels have worked fine, and I haven’t introduced a discontinuity in the parameter space. Is there another aspect of the model that would make it not amenable to NUTS, that may result in the collapsing acceptance stat?

Hitting the maximum treedepth is not as big a problem as divergences. If you only hit the maximum treedepth a few times, then it is likely not worth the time to re-run a long model. But if you hit the maximum treedepth often, that is a sign that it cannot get to some part of the parameter space from some other part of the parameter space in one transition, which is not very efficient. My guess is that it is not that relevant to the problem of the acceptance probability approaching zero.

The average acceptance probability falling to zero indicates that the numerical integrator within HMC is strongly deviating away from the energy level set that it’s meant to explore. This is typically caused by strong curvature, where the gradients rapidly oscillate across large magnitudes.

When the average acceptance probability falls to zero the adaptation compensates by dropping the step size to smaller values to increase the accuracy of the numerical integrator. Smaller step sizes require many more steps to explore the same neighborhood of parameter space, which is why you are saturating the maximum treedepth and getting inefficient exploration.

The fact that the average acceptance probability decays implies that either adaptation was terminated prematurely or that the curvature in your posterior is strongly varying. In the latter case the sampler would move from a well-behaved region to a poorly-behaved region and start falling apart.

This often happens when the target model is not strongly identified and part of the weakly-identitified submanifold features strong curvature, but it could be any number of pathologies.

1 Like

This is very helpful, thank you. I’ll think more about this. I suspect I’ve been prematurely terminating the adaption. I had also incorporated some identifying restrictions in the model (fixing some parameters to zero), which I initially thought were necessary. Now, I’m not so convinced, and I’m wondering if those same assumptions may actually be leading to the strong curvature you mention, although I’m not sure I have a great grasp of what strong curvature means in practice. In any case, this is a good starting point for me to structure my thinking.

We usually look at the second-order approximation of the curvature, namely the matrix H(\theta) = \nabla_{\!\theta} \,\, \nabla_{\!\theta} \log p(\theta \mid y) of second derivatives of the log posterior density, which is where Stan’s sampling. If that varies a lot for different \theta, as it does in Neal’s funnel example (covered in the Stan users’ guide), where there is very high curvature in the neck, and very low curvature in the body. There can also be trouble when the condition of H(\theta) (ratio of largest to smallest Eigenvalue) gets high. This also happens in the funnel, which is axis aligned—the vertical dimension is basically just normal, whereas the horizontal dimension gets arbitrarily badly conditioned down in the neck.