Automagically increase `adapt_delta` until all divergences are eliminated, what could go wrong?

So, I thought my above question was stupid, but then I found this:

So, in pseudo-pseudo code but actually actual code, I do this:

    while fit.no_divergences:
        print(f'We have {fit.no_divergences} divergences. Increasing `adapt_delta` from {fit.adapt_delta}.')
        fit = fit.resample(
            adapt_delta=np.sqrt(fit.adapt_delta)
        )

where resample takes the original fit, takes the last draws as initialization, recomputes the metric based on the previous samples, and then just restarts with exactly the same arguments except a higher adapt_delta.

My question is, what could go wrong? (Except the loop never terminating).

Edit: Forgot to mention something: Here, resample only runs the final adaptation window to find a step size and then retries to sample.

1 Like

I think the concern is that usually if adapt_delta is set very high and still isn’t working then there is probably a large issue with the model… Generally I try to figure out where the degeneracies are happening and reparameterize the model if possible. If you just keep increasing adapt_delta, you might end up ignoring important parts of your model which aren’t working as expected. Just my 2 cents though.

7 Likes

Just chiming in to support @be_green’s response.

Current opinion from folks I respect here seems to be that increasing adapt_delta is a last resort for fixing divergences. Divergences usually signal that the model has fundamental structural pathology, and addressing those will yield not only elimination of divergences but faster and more accurate inference.

Note that after Andrew’s comment you linked, there was discussion to change the wording of the diagnostics output to something more aligned with my statement above.

4 Likes

I think this is a great question, and it’s also related to the zeitgeist here that models requiring extremely deep treedepths ought to be reparameterized, even if the computational resources are available to fit the model as-is.

My understanding is that as models venture into difficult territory that requires high adapt_delta and long integration times, we should begin to get concerned that the complicated posterior geometry hides important features that we miss in our exploration. Thus we worry about passing diagnostics without having properly explored the posterior.

Note however, that it seems to be generally accepted that we can (and often must) safely use high treedepth and adapt_delta for some important classes of model, including those with horseshoe priors.

4 Likes

Ah, I wasn’t aware that high treedepths also signal necessity of reparameterization, thanks!

Thinking about automated workflows: The new stanc3 --info output lists the functions the model uses; an idea would be to use this to discern if we’re dealing with that class before automatically modifying those sampler parameters. Is the horseshoe a standard _lpdf function these days, or are people defining something by hand still?

1 Like

Thank you all for your contributions :)

Yes, that’s what I think as well, I guess at some threshold of adapt_delta one will have to give up hope D:

Right, the warning is no longer

There were 23 divergent transitions after warmup. Increasing adapt_delta above 0.8 may help.

but apparently

Examine the pairs() plot to diagnose sampling problems

I didn’t actually notice. I think this makes sense.

This is also good to know, thanks!

(I think) It may also just signal insufficient adaptation, but this may in turn signal the need for a reparameterization D:

1 Like

Do you know if this would be checkable during adaptation itself? I’m working on a non-standard workflow with checks-during-sampling that I could easily make checks-during-warmup-and-sampling and make things continue warmup until this check passes.

Hmm, In my experience the best indicator that your previous adaptation was insufficient is that further adaptation reduces the average n_leapfrog__. So not sure how helpful this is D:

Does this suggest an adaptation termination condition whereby the history of n_leapfrog is checked for going flat?

Hmm, I’m not really the person with the expertise to ask, but to first approximation I would say yes. But probably depends on a thousand other factors.

1 Like

The problem is that the metric adaptation is windowed, so if you intend to check the distribution of treedepths for stationarity, you end up doing an entire extra window, and since the window size keeps doubling you end up with a huge amount of unnecessary warmup computation (though to be fair, if it goes flat fast enough you might end up saving a huge amount of computation).

If you are going to try to check this adaptively, it seems like it would be more direct to check the metric itself for going flat, rather than indirectly assessing this by looking at the distribution of treedepths.

You also need to be vigilant about the possibility of false flatness due to bad updates to the metric, which should tend to be more severe earlier in warmup.

2 Likes

Yes, my thought would be to have a minimum of some sort (possibly consisting of the current standard 1e3 for the entire warmup) before checking

1 Like

Hi all. Just to clarify: Much (if not most) of the time there are these HMC problems, the solution is to just change the model, perhaps by adding stronger priors or by fixing a bug or a misunderstanding somewhere. Reparameterization can be fine too, but I don’t think “reparameterization” is a great general recommendation because that implies that the model remains unchanged and is just reparameterized. I’d say the first step is to figure out what’s happening and the second step is typically to fix a problem with the model or to add some soft constraints (prior information) to get both computational and inferential stability.

We definitely don’t want to recommend that people automatically increase these tuning parameters, as this can have the effect of making bad models take longer to run. We want to fail fast where possible.

5 Likes

Wait, I just accepted this at first, but why would the metric (spuriously) stay flat?

Hm, this is what I feared. But in all seriousness, I do wonder how to decide when it is appropriate to increase adapt_delta… BRB, I’m just quickly gonna find out.

For some context:

I had a model, which fit perfectly with the default adapt_delta=0.8. Then, I added the modelling of some effect in the data, and suddenly I had some divergences during sampling, but with Rhat<1.01 and EBFMI>.99, which was new terrain for me.

Then, I introduced the above loop, and the model fit fine, with an adapt_delta around .9.

Then however I added an additional term, and suddenly the divergences did not disappear any longer.

The good thing is, I do usually fail as fast as possible, so the feedback that something was wrong was quite immediate. But, still I wonder, was the above loop even appropriate in the first place?

Thank you all!

In the long run it won’t but the updates are stochastic, and more-so in the early phases when windows are short.

1 Like

The loop never terminating is an immediate obstruction for me.

There are zero guarantees that using a less aggressive adaptation will reduce the number of divergences. In many cases using a smaller step size can actually increase the number of divergences as the finer resolution of the exploration can better see just how nasty some part of the target distribution is.

On a more practical note once you push past adaption targets of 0.99 or so there usually isn’t enough information in the early phases of exploration for the adaptation to be able to reach very different values anyways. In other words the adaptation is dominated by the variation in the Markov chain realizations unless really long windows are used. To push past that one would have to configure the step size directly, but that still wouldn’t avoid the lack of guarantees mentioned above.

Even in the best cases forcing a much less aggressive adaptation will typically result in much longer, and more expensive, trajectories. Quickly enough the max treedepth will be saturated and you’ll have to intervene to change that anyways, limiting the automation possibilities a bit.

3 Likes

Which is a good thing though, isn’t it? Not for exiting the loop, but for the information it provides.

It’s very a much a good thing if you can access that information. Tuning adapt_delta by hand and investigating the results (not just the change in the number of divergent transitions but also how the geometry of those divergent transitions changes) is a powerful strategy.

But if you’re attempting to automatically increasing adapt_delta until the divergences are eliminated and then end up stuck in a non-terminating loop because the divergences don’t actually decrease with increasing adapt_delta then you’ll never have access to that information. Even if you’re dumping the fit info after each iteration to a place where it can be recovered while the loop is still running, you’ll end up wasting a ton of compute on the latter iterations where the step size isn’t changing much and you’re just recovering equivalently pathological fits.

On the other hand working through the iterations by hand – increasing adapt_delta and then thoroughly investigating before trying to increase adapt_delta again – is straightforward to implement and avoids any computational waste.

1 Like