Request for Volunteers to Test Adaptation Tweak

I run a lot of models where I’ll get a few divergences in 4000 post-warmup iterations that can be removed by reducing step size. Is that still problematic? If so, we need to change our warning messages which recommend doing just that!

2 Likes

If reducing the step size by increasing the adapt_delta target removes the divergences then any bias in the MCMC estimators should be negligible, so the advice still stands. The problem is that if you have to decrease the step size too much then you end up significantly increasing the computational burden; it’s a brute force cludge that we don’t want people to get in the habit of resorting to it all the time.

Almost always those residual divergences indicate poor identifiability in the observational model that propagates to the posterior because of priors that are too diffuse (this happens a lot if hierarchical models – my hypothesis is that rstanarm and brms have to resort to such high default adapt_delta because their default GLM and hierarchal modeling priors are way too diffuse). It is much, much better to investigate that and figure out what prior information is necessary. This is definitely helped with better pedagogy (a big focus of my courses and case studies, which I’m writing as fast as I can) but ideally the user would take some responsibility. In other words a divergence should mean “there’s a modeling problem, I should investigate” and not “something weird happened and I want it to go away”.

3 Likes

This is probably true.

1 Like

Oh yeah the follow up from the meeting was that nobody was bothered by it as long as the interface stayed the same (and of course everyone appreciates a performance bump).

I think Bob’s question is the last thing that needs to be checked and then this is good. So the question is does upping adapt_delta still get rid of divergences in a similar way as it did previously.

Like some people did get mileage out of adapt_delta = 0.99 and 0.95 or whatever. I guess that means find a model where this helped previously and show that it still works. Presumably it will but may as well check.

@betanalpha I’ve heard this sort of thing from you many times and I never understand it. When you say the model needs better / more informative priors, do you mean ‘for the sake of the sampling geometry’, with no regard for the resulting inferences, or are you actually saying that somehow the sampling geometry needs to be fixed in order to give the inferences we wanted to make with the original model? Next time someone comes to me with a broken Bayesian model can I suggest maximum likelihood optimization as a fix, since it avoids all these worries about problematic tail geometry? ;)

The monotonic relationship between adapt_delta and the leapfrog integrator step size is unchanged by this update, and so increasing adapt_delta will by construction reduce the step size and potentially the number of divergences (i.e. if it reduced divergences before, it will reduce them now). Remember that conditioned on the same step size nothing changes here – the adapted step size will just be a bit larger.

If you want to verify this empirically then please go ahead, but given the upcoming 2.21 release I will start pushing hard if this can’t be accomplished by next week.

Any changes to the algorithm will change the exact results of a model fit. The recent addition of more checks changed the results seen by the interfaces, and this will too. The changes will just be more visible in the presence of a problematic model. Because the qualitative user experience doesn’t change, however, the authority remains on the algorithm side.

Divergences are almost always caused by degeneracies in the posterior density function, and typically nonlinear ones. For example the infamous funnel geometry is induced by the conditional distribution \mathcal{N}(\theta | 0, \tau) where limited observations of \theta are unable to identify both \theta and \tau marginally.

Because we are interested in Bayesian modeling we have to explore all of these degeneracies to get a faithful representation of our posterior distribution. Divergences tell us that HMC has been unable to achieve that exploration, and hence that our samples are giving an unfaithful representation of the posterior.
You can try to improve the sampler to achieve the desired exploration, or you can improve the model by incorporating more domain expertise in the prior model or even collecting more data.

Ultimately these degeneracies indicate that the data and current prior model are insufficient to inform the parameters – in other words they tell us when we have poor experimental design. This is why they are so awesome. When we’re not learning what we want to learn it is much more useful to go back and reinforce the experimental design with more prior information or improve the experimental design with additional observations or even auxiliary observations.

2 Likes

Right, the solutions of either ‘reparameterize to improve geometry’, ‘get extra data so we don’t have to explore crazy geometry’ or ‘improve sampler to cope with geometry’ all seem very reasonable. I’m also comfortable with the idea of iterative model building and treating priors as something akin to soft model structure, I’m just uneasy with the idea of updating them based on the difficulties encountered using a specific algorithm – it seems quite conceivable that algorithmic improvements occur that resolve the original difficulties…

So you run a much more expensive sampler and better resolve those pathological corners…to reveal that your parameters weren’t well identified and you can’t get the inferences you wanted anyways? Sampling doesn’t happen in a vacuum. Divergences almost always indicate a weak identifiability problem which strongly hints at downstream inferential problems. One can try to brute force the computation and deal with the identifiability issues later down the road, but why wait?

Of course one doesn’t want to play with the priors just to get rid of the divergences. One has to follow the divergences to confirm an identifiability problem consistent with one’s domain expertise of the experiment before trying to increase the domain expertise embodied in the prior model. I have yet, however, seen the case where divergences don’t indicate a real identifiability problem so my prior is strongly concentrated on that being the most likely possibility in a given fit.

2 Likes

I don’t think we should reduce testing burden because a release deadline is coming up. Like they say on the subway, there’s another train behind this one.

4 Likes

In this case there isn’t another train on the horizon, but there is a reason to get things order for the next release (an actual naming of the sampler).

Perhaps more relevant to process, the concern being raised is verified to not be an issue by inspection: by construction the new adaption target is always lower bounded by the old adaptation target, and hence will always yield a higher adapted step size (modulo noise during warmup). The rest of the sample behavior depends only on that step size, so the consequences after that point are straightforward.

ok, I like this more nuanced stance.

1 Like

Can you remind your explanation how funnel in case of non-centered parameterization and strong likelihood indicates identifiability problem?

1 Like

Kind of secondary, but do you have scripts you use to look at this stuff and if so can you slam them over here? It would be nice to have stuff to goof around with the algorithms and see how things change. Those sorts of things would also be nice to give deeper inspections when running the performance tests

1 Like

In the diffuse likelihood case you can’t identify the absolute individual parameters, theta, and the population, (mu, tau), at the same time, but you can identify the relative individual deviations, (theta - my) / tau, from the population.

In the strong likelihood case you can identify the absolute individual parameters from the population, but the individual deviations become poorly identified relative to the population.

These considerations are per individual — if he likelihood concentrations are unbalanced then you will likely have to center/non center each individual differently.

There’s also the poor identifiability of mu and tau themselves in the case of diffuse likelihoods or small numbers of groups.

I’m traveling this week but if you remind me next week then I’ll throw my stuff up on a github repo. It’s not particularly robust as it dumps a folder of scripts into a CmdStan directory and has a few hardcoded paths. @bbbales2 also has some scripts that he uses.

1 Like

Whoops, sorry I let this drop, but run the tests.

These tests verify important behavior. What I want to see is a model that has divergences that go away with a larger adapt delta in both the old and new versions. Maybe a thing to watch out for is if the types of changes we’ve been recommending still make sense (does the new adaptation require adapt_delta = 0.999 whereas the old one required only adapt_delta = 0.9? Or is it about the same?)

I think this model base.data.R (682.7 KB) base.stan (8.4 KB) will spit out maybe 50 (± 25) divergences. I’m not totally sure though and it takes about an hour to run. Presumably there are other simpler models that would work here. Maybe an 8-schools where the measurement errors on each school are smaller.

Let me one more time emphasize that this code has no influence on the relationship between the integrator step size and the behavior of the integrator, including whether or not divergences arise and if so how many, so that is not within the purview of this PR. The only relevant question is whether or not increasing the adaptation target, adapt_delta, does indeed decrease the step size monotonically (modulo noise in the adaptation).

If the structure of the new adaptation target (just a reweighing of the current adaption target) isn’t sufficient once can verify empirically. Here is a small test using a 50-dimensional iid normal target density demonstrating the expected monotonic relationship,

Target Adapted Stepsize
0.8 0.680359
0.825 0.525104
0.85 0.594495
0.875 0.459662
0.9 0.452326
0.925 0.446792
0.95 0.323225
0.975 0.266369
0.99 0.1841
0.999 0.106148

I think if you would add third column showing the adapted stepsize using the old code, that would answer Ben’s question.

1 Like

The behavior of the adapted stepsize using the old code is well-established – it too decreases monotonically with increased adapt delta. Testing burden that focuses on existing behavior is an indication that something is going wrong with the review process.

The relationship between the two has also been well established in the extensive testing already performed – the adapted stepsize using the new code is always higher than using the old code.

adapt_delta Develop adapted stepsize
0.8 0.550541
0.825 0.457603
0.85 0.466621
0.875 0.530778
0.9 0.458668
0.925 0.381369
0.95 0.371774
0.975 0.271149
0.99 0.195458
0.999 0.090217
0.9999 0.080215