I’m teaching through the week and much of next week but might be able to sneak something in between next week if you don’t get to it in time.

The new acceptance statistic doesn’t consider transitions to the initial point, so the weight of 1 = exp(0) has to be removed from the running log_sum_weight aggregator.

It is done in cycles, kinda. The metric and timestep are done a bit differently. The cmdstan manual has the best graphic of how the metric adaptation is done. It’s Figure 9.3 in the 2.17.1 manual at least.

By default there are like 5~ windows of increasing size. The HMC draws from the nth stage are used to build the metric in the (n + 1)th stage.

Timestep is being adapted continuously (from draw to draw). It’s all sorts of funky games, which is why we gotta throw the warmup stuff away in the end.

Yeah. In terms of divergences the tweak just rescales adapt_delta so it’s equivalent to running the old algorithm with a slightly lower adapt_delta or equivalently a higher step size. When a model is diverging both should lead to more divergent trajectories (modulo statistical variation).

I was writing up a summary of this change this morning, and had a question.

What is changing is the definition of the accept probability that adapt_delta targets. In the old adaptation, every point in the trajectory has an accept probability, and the overall except probability is just the uniform average of these.

In the new adaptation, this average is reweighted according to the probabilities NUTS uses to sample the next draw from the trajectory.

Isn’t this sort of double counting? Like, the Metropolis accept probabilities are what we’d have if we didn’t do multinomial NUTS. Now we add multinomial nuts on top so we don’t need the accept probabilities.

So reweighting the accept probability average with the multinomial probabilities seems like we’re doing the same thing twice?

Notational point – the sampler in Stan is significantly removed from that introduced in the NUTS paper and we’ve moved on from calling it NUTS to avoid confusion (many people assume that it’s still the specification in that paper). Right now “dynamic multinomial” is good, although I’ll introduce a new name soon.

Correct. The idea here is to presume hypothetical proposals from the initial state to each state in the trajectory, and then average over those proposals uniformly.

No, because we actually have two processes going on at once here. First we have the hypothetical proposals, and their corresponding acceptance probabilities. Second we have the process of choosing between those hypothetical proposals. Because multinomial sampling will prefer some states over others we want to emphasize those more common states in the proxy acceptance statistic.

In other words we want to emphasize the high weight states because those will be the most common. If we weight uniformly then we let low probability states dominate the acceptance statistic.

With the termination criterion robustified the behavior of this tweak is a bit more as expected. The step size increases which decreases the number of leapfrog steps systematically, although the difference is somewhat small given that the multiplicative expansion limits the distribution of leapfrog steps.

In the normal IID case you can definitely see the increase step sizes “stretching” the nominal behavior, pushing the thresholds to higher dimensions.

‘improved_alg’ has both the new Uturn and timestep behaviors convoluted together though, right? What’s the difference between new Uturn and then new Uturn + new adaptation?

Sorry, stale legend. The dark red labeled “v2.20.0” is actually the current version of develop with the additional termination checks, so this is the comparison between new u-turn and new u-turn plus new adaptation.

It looks like there’s a treedepth thing happening at ~250 dimensions with the IID normals, right? Like the N_eff is getting crazy high for the old stuff, but apparently it’s just doing more sampling cause the Neff/gradient is the same?

This is an artifact of the multiplicative expansion and the “resonances” in effective sample size per gradient evaluation. At D=250 develop jumps to the next treedepth, where as the new adaptation won’t jump until a much higher dimension.

All the new adaptation does is stretch out the step size verses dimension curve, and hence everything else that depends on the step size (module dimensionality effects, so it’s not perfect). For example, consider D=50 for the develop implementation, which results in a step size of around 0.6. The new adaptation, however, doesn’t get to stepsize ~ 0.6 until D ~ 250, so the performance of develop at D=50 and the new adaptation at D=250 should be similar. At the very least the number of leapfrog steps is the same in those two circumstances.

Yo @ariddell and @ahartikainen . This is a new timestep adaptation @betanalpha came up with and we’re looking at putting in the algorithms.

I’m gonna talk about it at the Stan meeting tomorrow. I assume a pull request will appear in the not so distant future if nobody has any great objections.

If a fit is diverging then there are fundamental problems (i.e. a breakdown of the treasured MCMC central limit theorem) no matter how many divergences there are, so in my opinion that’s a bit of a red herring.

The resulting step size will still be monotonic with adapt_delta, so increasing adapt_delta will still decrease the step size and potentially reduce divergences as with the current adaptation criteria.

I run a lot of models where I’ll get a few divergences in 4000 post-warmup iterations that can be removed by reducing step size. Is that still problematic? If so, we need to change our warning messages which recommend doing just that!

If reducing the step size by increasing the adapt_delta target removes the divergences then any bias in the MCMC estimators should be negligible, so the advice still stands. The problem is that if you have to decrease the step size too much then you end up significantly increasing the computational burden; it’s a brute force cludge that we don’t want people to get in the habit of resorting to it all the time.

Almost always those residual divergences indicate poor identifiability in the observational model that propagates to the posterior because of priors that are too diffuse (this happens a lot if hierarchical models – my hypothesis is that rstanarm and brms have to resort to such high default adapt_delta because their default GLM and hierarchal modeling priors are way too diffuse). It is much, much better to investigate that and figure out what prior information is necessary. This is definitely helped with better pedagogy (a big focus of my courses and case studies, which I’m writing as fast as I can) but ideally the user would take some responsibility. In other words a divergence should mean “there’s a modeling problem, I should investigate” and not “something weird happened and I want it to go away”.