Request for Volunteers to Test Adaptation Tweak

With the termination criterion robustified the behavior of this tweak is a bit more as expected. The step size increases which decreases the number of leapfrog steps systematically, although the difference is somewhat small given that the multiplicative expansion limits the distribution of leapfrog steps.

In the normal IID case you can definitely see the increase step sizes “stretching” the nominal behavior, pushing the thresholds to higher dimensions.

stepsize_report.pdf (454.4 KB)

@bbbales2 – any thoughts?

‘improved_alg’ has both the new Uturn and timestep behaviors convoluted together though, right? What’s the difference between new Uturn and then new Uturn + new adaptation?

Sorry, stale legend. The dark red labeled “v2.20.0” is actually the current version of develop with the additional termination checks, so this is the comparison between new u-turn and new u-turn plus new adaptation.

It looks like there’s a treedepth thing happening at ~250 dimensions with the IID normals, right? Like the N_eff is getting crazy high for the old stuff, but apparently it’s just doing more sampling cause the Neff/gradient is the same?

This is an artifact of the multiplicative expansion and the “resonances” in effective sample size per gradient evaluation. At D=250 develop jumps to the next treedepth, where as the new adaptation won’t jump until a much higher dimension.

All the new adaptation does is stretch out the step size verses dimension curve, and hence everything else that depends on the step size (module dimensionality effects, so it’s not perfect). For example, consider D=50 for the develop implementation, which results in a step size of around 0.6. The new adaptation, however, doesn’t get to stepsize ~ 0.6 until D ~ 250, so the performance of develop at D=50 and the new adaptation at D=250 should be similar. At the very least the number of leapfrog steps is the same in those two circumstances.

Yo @ariddell and @ahartikainen . This is a new timestep adaptation @betanalpha came up with and we’re looking at putting in the algorithms.

I’m gonna talk about it at the Stan meeting tomorrow. I assume a pull request will appear in the not so distant future if nobody has any great objections.

The tldr; is it makes things faster (see pdf here) Request for Volunteers to Test Adaptation Tweak but comes with more divergences: Request for Volunteers to Test Adaptation Tweak .

If yah have comments, let them be heard!

Thanks! If it really comes with more divergences, will there be a way to get those back down by controlling adaptation accept target?

If a fit is diverging then there are fundamental problems (i.e. a breakdown of the treasured MCMC central limit theorem) no matter how many divergences there are, so in my opinion that’s a bit of a red herring.

The resulting step size will still be monotonic with adapt_delta, so increasing adapt_delta will still decrease the step size and potentially reduce divergences as with the current adaptation criteria.

I run a lot of models where I’ll get a few divergences in 4000 post-warmup iterations that can be removed by reducing step size. Is that still problematic? If so, we need to change our warning messages which recommend doing just that!

2 Likes

If reducing the step size by increasing the adapt_delta target removes the divergences then any bias in the MCMC estimators should be negligible, so the advice still stands. The problem is that if you have to decrease the step size too much then you end up significantly increasing the computational burden; it’s a brute force cludge that we don’t want people to get in the habit of resorting to it all the time.

Almost always those residual divergences indicate poor identifiability in the observational model that propagates to the posterior because of priors that are too diffuse (this happens a lot if hierarchical models – my hypothesis is that rstanarm and brms have to resort to such high default adapt_delta because their default GLM and hierarchal modeling priors are way too diffuse). It is much, much better to investigate that and figure out what prior information is necessary. This is definitely helped with better pedagogy (a big focus of my courses and case studies, which I’m writing as fast as I can) but ideally the user would take some responsibility. In other words a divergence should mean “there’s a modeling problem, I should investigate” and not “something weird happened and I want it to go away”.

3 Likes

This is probably true.

1 Like

Oh yeah the follow up from the meeting was that nobody was bothered by it as long as the interface stayed the same (and of course everyone appreciates a performance bump).

I think Bob’s question is the last thing that needs to be checked and then this is good. So the question is does upping adapt_delta still get rid of divergences in a similar way as it did previously.

Like some people did get mileage out of adapt_delta = 0.99 and 0.95 or whatever. I guess that means find a model where this helped previously and show that it still works. Presumably it will but may as well check.

@betanalpha I’ve heard this sort of thing from you many times and I never understand it. When you say the model needs better / more informative priors, do you mean ‘for the sake of the sampling geometry’, with no regard for the resulting inferences, or are you actually saying that somehow the sampling geometry needs to be fixed in order to give the inferences we wanted to make with the original model? Next time someone comes to me with a broken Bayesian model can I suggest maximum likelihood optimization as a fix, since it avoids all these worries about problematic tail geometry? ;)

The monotonic relationship between adapt_delta and the leapfrog integrator step size is unchanged by this update, and so increasing adapt_delta will by construction reduce the step size and potentially the number of divergences (i.e. if it reduced divergences before, it will reduce them now). Remember that conditioned on the same step size nothing changes here – the adapted step size will just be a bit larger.

If you want to verify this empirically then please go ahead, but given the upcoming 2.21 release I will start pushing hard if this can’t be accomplished by next week.

Any changes to the algorithm will change the exact results of a model fit. The recent addition of more checks changed the results seen by the interfaces, and this will too. The changes will just be more visible in the presence of a problematic model. Because the qualitative user experience doesn’t change, however, the authority remains on the algorithm side.

Divergences are almost always caused by degeneracies in the posterior density function, and typically nonlinear ones. For example the infamous funnel geometry is induced by the conditional distribution \mathcal{N}(\theta | 0, \tau) where limited observations of \theta are unable to identify both \theta and \tau marginally.

Because we are interested in Bayesian modeling we have to explore all of these degeneracies to get a faithful representation of our posterior distribution. Divergences tell us that HMC has been unable to achieve that exploration, and hence that our samples are giving an unfaithful representation of the posterior.
You can try to improve the sampler to achieve the desired exploration, or you can improve the model by incorporating more domain expertise in the prior model or even collecting more data.

Ultimately these degeneracies indicate that the data and current prior model are insufficient to inform the parameters – in other words they tell us when we have poor experimental design. This is why they are so awesome. When we’re not learning what we want to learn it is much more useful to go back and reinforce the experimental design with more prior information or improve the experimental design with additional observations or even auxiliary observations.

2 Likes

Right, the solutions of either ‘reparameterize to improve geometry’, ‘get extra data so we don’t have to explore crazy geometry’ or ‘improve sampler to cope with geometry’ all seem very reasonable. I’m also comfortable with the idea of iterative model building and treating priors as something akin to soft model structure, I’m just uneasy with the idea of updating them based on the difficulties encountered using a specific algorithm – it seems quite conceivable that algorithmic improvements occur that resolve the original difficulties…

So you run a much more expensive sampler and better resolve those pathological corners…to reveal that your parameters weren’t well identified and you can’t get the inferences you wanted anyways? Sampling doesn’t happen in a vacuum. Divergences almost always indicate a weak identifiability problem which strongly hints at downstream inferential problems. One can try to brute force the computation and deal with the identifiability issues later down the road, but why wait?

Of course one doesn’t want to play with the priors just to get rid of the divergences. One has to follow the divergences to confirm an identifiability problem consistent with one’s domain expertise of the experiment before trying to increase the domain expertise embodied in the prior model. I have yet, however, seen the case where divergences don’t indicate a real identifiability problem so my prior is strongly concentrated on that being the most likely possibility in a given fit.

2 Likes

I don’t think we should reduce testing burden because a release deadline is coming up. Like they say on the subway, there’s another train behind this one.

4 Likes

In this case there isn’t another train on the horizon, but there is a reason to get things order for the next release (an actual naming of the sampler).

Perhaps more relevant to process, the concern being raised is verified to not be an issue by inspection: by construction the new adaption target is always lower bounded by the old adaptation target, and hence will always yield a higher adapted step size (modulo noise during warmup). The rest of the sample behavior depends only on that step size, so the consequences after that point are straightforward.

ok, I like this more nuanced stance.

1 Like