Request for Volunteers to Test Adaptation Tweak

I ran some eight schools tests. 1000 centered models with the new vs. the old stepsizes.

divergences

As expected we got some more divergences. For the plots I threw away calculations with >150 divergences.

In terms of quantiles:

New:
0%  25%  50%  75% 100% 
0   11   19   32  990 

Old:
0%  25%  50%  75% 100% 
0    8   15   28  540

I guess this is roughly what we expected?

Yeah. In terms of divergences the tweak just rescales adapt_delta so it’s equivalent to running the old algorithm with a slightly lower adapt_delta or equivalently a higher step size. When a model is diverging both should lead to more divergent trajectories (modulo statistical variation).

I was writing up a summary of this change this morning, and had a question.

What is changing is the definition of the accept probability that adapt_delta targets. In the old adaptation, every point in the trajectory has an accept probability, and the overall except probability is just the uniform average of these.

In the new adaptation, this average is reweighted according to the probabilities NUTS uses to sample the next draw from the trajectory.

Isn’t this sort of double counting? Like, the Metropolis accept probabilities are what we’d have if we didn’t do multinomial NUTS. Now we add multinomial nuts on top so we don’t need the accept probabilities.

So reweighting the accept probability average with the multinomial probabilities seems like we’re doing the same thing twice?

Notational point – the sampler in Stan is significantly removed from that introduced in the NUTS paper and we’ve moved on from calling it NUTS to avoid confusion (many people assume that it’s still the specification in that paper). Right now “dynamic multinomial” is good, although I’ll introduce a new name soon.

Correct. The idea here is to presume hypothetical proposals from the initial state to each state in the trajectory, and then average over those proposals uniformly.

No, because we actually have two processes going on at once here. First we have the hypothetical proposals, and their corresponding acceptance probabilities. Second we have the process of choosing between those hypothetical proposals. Because multinomial sampling will prefer some states over others we want to emphasize those more common states in the proxy acceptance statistic.

In other words we want to emphasize the high weight states because those will be the most common. If we weight uniformly then we let low probability states dominate the acceptance statistic.

Can you do a phone call sometime on this? My internet is sketchy where I’m at, but I’d like to talk through this. Still not getting it.

With the termination criterion robustified the behavior of this tweak is a bit more as expected. The step size increases which decreases the number of leapfrog steps systematically, although the difference is somewhat small given that the multiplicative expansion limits the distribution of leapfrog steps.

In the normal IID case you can definitely see the increase step sizes “stretching” the nominal behavior, pushing the thresholds to higher dimensions.

stepsize_report.pdf (454.4 KB)

@bbbales2 – any thoughts?

‘improved_alg’ has both the new Uturn and timestep behaviors convoluted together though, right? What’s the difference between new Uturn and then new Uturn + new adaptation?

Sorry, stale legend. The dark red labeled “v2.20.0” is actually the current version of develop with the additional termination checks, so this is the comparison between new u-turn and new u-turn plus new adaptation.

It looks like there’s a treedepth thing happening at ~250 dimensions with the IID normals, right? Like the N_eff is getting crazy high for the old stuff, but apparently it’s just doing more sampling cause the Neff/gradient is the same?

This is an artifact of the multiplicative expansion and the “resonances” in effective sample size per gradient evaluation. At D=250 develop jumps to the next treedepth, where as the new adaptation won’t jump until a much higher dimension.

All the new adaptation does is stretch out the step size verses dimension curve, and hence everything else that depends on the step size (module dimensionality effects, so it’s not perfect). For example, consider D=50 for the develop implementation, which results in a step size of around 0.6. The new adaptation, however, doesn’t get to stepsize ~ 0.6 until D ~ 250, so the performance of develop at D=50 and the new adaptation at D=250 should be similar. At the very least the number of leapfrog steps is the same in those two circumstances.

Yo @ariddell and @ahartikainen . This is a new timestep adaptation @betanalpha came up with and we’re looking at putting in the algorithms.

I’m gonna talk about it at the Stan meeting tomorrow. I assume a pull request will appear in the not so distant future if nobody has any great objections.

The tldr; is it makes things faster (see pdf here) Request for Volunteers to Test Adaptation Tweak but comes with more divergences: Request for Volunteers to Test Adaptation Tweak .

If yah have comments, let them be heard!

Thanks! If it really comes with more divergences, will there be a way to get those back down by controlling adaptation accept target?

If a fit is diverging then there are fundamental problems (i.e. a breakdown of the treasured MCMC central limit theorem) no matter how many divergences there are, so in my opinion that’s a bit of a red herring.

The resulting step size will still be monotonic with adapt_delta, so increasing adapt_delta will still decrease the step size and potentially reduce divergences as with the current adaptation criteria.

I run a lot of models where I’ll get a few divergences in 4000 post-warmup iterations that can be removed by reducing step size. Is that still problematic? If so, we need to change our warning messages which recommend doing just that!

2 Likes

If reducing the step size by increasing the adapt_delta target removes the divergences then any bias in the MCMC estimators should be negligible, so the advice still stands. The problem is that if you have to decrease the step size too much then you end up significantly increasing the computational burden; it’s a brute force cludge that we don’t want people to get in the habit of resorting to it all the time.

Almost always those residual divergences indicate poor identifiability in the observational model that propagates to the posterior because of priors that are too diffuse (this happens a lot if hierarchical models – my hypothesis is that rstanarm and brms have to resort to such high default adapt_delta because their default GLM and hierarchal modeling priors are way too diffuse). It is much, much better to investigate that and figure out what prior information is necessary. This is definitely helped with better pedagogy (a big focus of my courses and case studies, which I’m writing as fast as I can) but ideally the user would take some responsibility. In other words a divergence should mean “there’s a modeling problem, I should investigate” and not “something weird happened and I want it to go away”.

3 Likes

This is probably true.

1 Like

Oh yeah the follow up from the meeting was that nobody was bothered by it as long as the interface stayed the same (and of course everyone appreciates a performance bump).

I think Bob’s question is the last thing that needs to be checked and then this is good. So the question is does upping adapt_delta still get rid of divergences in a similar way as it did previously.

Like some people did get mileage out of adapt_delta = 0.99 and 0.95 or whatever. I guess that means find a model where this helped previously and show that it still works. Presumably it will but may as well check.

@betanalpha I’ve heard this sort of thing from you many times and I never understand it. When you say the model needs better / more informative priors, do you mean ‘for the sake of the sampling geometry’, with no regard for the resulting inferences, or are you actually saying that somehow the sampling geometry needs to be fixed in order to give the inferences we wanted to make with the original model? Next time someone comes to me with a broken Bayesian model can I suggest maximum likelihood optimization as a fix, since it avoids all these worries about problematic tail geometry? ;)

The monotonic relationship between adapt_delta and the leapfrog integrator step size is unchanged by this update, and so increasing adapt_delta will by construction reduce the step size and potentially the number of divergences (i.e. if it reduced divergences before, it will reduce them now). Remember that conditioned on the same step size nothing changes here – the adapted step size will just be a bit larger.

If you want to verify this empirically then please go ahead, but given the upcoming 2.21 release I will start pushing hard if this can’t be accomplished by next week.

Any changes to the algorithm will change the exact results of a model fit. The recent addition of more checks changed the results seen by the interfaces, and this will too. The changes will just be more visible in the presence of a problematic model. Because the qualitative user experience doesn’t change, however, the authority remains on the algorithm side.

Divergences are almost always caused by degeneracies in the posterior density function, and typically nonlinear ones. For example the infamous funnel geometry is induced by the conditional distribution \mathcal{N}(\theta | 0, \tau) where limited observations of \theta are unable to identify both \theta and \tau marginally.

Because we are interested in Bayesian modeling we have to explore all of these degeneracies to get a faithful representation of our posterior distribution. Divergences tell us that HMC has been unable to achieve that exploration, and hence that our samples are giving an unfaithful representation of the posterior.
You can try to improve the sampler to achieve the desired exploration, or you can improve the model by incorporating more domain expertise in the prior model or even collecting more data.

Ultimately these degeneracies indicate that the data and current prior model are insufficient to inform the parameters – in other words they tell us when we have poor experimental design. This is why they are so awesome. When we’re not learning what we want to learn it is much more useful to go back and reinforce the experimental design with more prior information or improve the experimental design with additional observations or even auxiliary observations.

2 Likes