Adapt_delta

I ran a Stan program and got this warning message:

There were 23 divergent transitions after warmup. Increasing adapt_delta above 0.8 may help. See
http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup

I went to the webpage and read the description, which was great. And then I re-ran my code setting adapt_delta:

fit <- stan(“chickens.stan”, data=data, control=list(adapt_delta=0.9))

And it worked fine.

Here’s my question. If our first recommendation is to increase adapt_delta, why not do this automatically? As a user, I’d find that convenient.

4 Likes

increase adapt_delta means take smaller step to approach, it will take longer time at the same time.

I’m still a beginner with Stan too, but as far as I understand, adapt_delta is the average probability of accepting a posterior draw. The probability of accepting a posterior draw is related to step size - how far the sampler “jumps” on each draw. To increase probability of acceptance, the sampler needs to decrease step size, and take smaller, more careful steps.

If you imagine the posterior (or the typical set) as a tall hill in the middle of a flat plain, then what a Monte Carlo sampler does is it tries to map out the shape of the hill by taking random steps around the hill & measuring height at each step. Additionally, it only takes a step if the elevation is higher in the next location, or takes the step with probability = (next location elevation / current location elevation) when the elevation of the next location is lower. If the steps that the sampler takes are very big, it will often miss or “overshoot” the area of higher elevation, and as such it’s acceptance rate will be lower. If the sampler takes very small steps, its acceptance rate will be high but it will be very slow and take a long time to explore the hill.

2 Likes

same as my understanding, like a learning rate in deep learning.

To continue this thread,

  1. If one’s ESS is low, instead of increasing iter, one could decrease adapt_delta as a means of reducing the autocorrelation of when a new proposed HMC value is accepted. This comes at the cost of reducing the rate at which the sampler moves.
  2. In addition, there is a direct computational advantage in reducing adapt_delta in that the calculation time for generating the sample via leapfrogging is reduced.
  3. Further, if one were naively interested in optimizing adapt_delta to a given problem, theoretically one would want to maximize the rate at which ESS increases with wall time given a particular set of computational resources.

Is my understanding for 1-3) correct?