Issue with dual averaging

I was debugging some models that seemed to be requiring a fairly high treedepth and came across a two slightly annoying features of the dual averaging algorithm. I’ve included the relevant formulas for reference.

During the adaptation phase, after each adaptation window the starting point for the step size is selected to be 10 times what the average stepsize (\bar x) from the previous window. I see how this makes sense: after the mass matrix is updated, it is reasonable to expect that a sligthly larger step size would be appropriate. However it causes some weird behavior. This step size is often way to large and results in an acceptance probability, \alpha being near zero. This causes a large increase in H, and a corresponding drop in x. In my models, it takes three or four steps for H to come back down to a reasonable value, and have \alpha \approx \gamma, and for H to begin decreasing. However, even once H is decreasing, the step size continues to fall due to the increased influence of the H term relative to \mu, meaning it takes a while for the step size to stabilize.

The other issue I am seeing is a “sawtooth” like behavior in the step size when using large values for \delta. This is because the range of the updates to H_t is assymetric, i.e. \left[\frac{\delta-1}{t+t_0},\frac{\delta}{t+t_0}\right]. After an iteration with a particularly low value of \alpha the step size can drop a lot, but it takes many iterations with \alpha near 1 to bring it back up again.

How was the multiplier of 10 between warmup rounds chosen? Are there any models that this is particularly helpful for? Is there anyway to make this a tuneable parameter to stan like the other adaptation parameters?

Is there a reason for the asymmetry in the updates to H? I was thinking if the update was done on \mbox{logit} (\delta) - \mbox{logit} (\alpha) or on \min(\delta-\alpha, 1-\delta).

I wanted to post here and see if people thought it is a good idea or if it has been tried already before I start trying to make changes to the adaptation algorithm.


The relevant parts from Hoffman & Gelman 2014
x_{t+1}\leftarrow \mu - \frac{\sqrt{t}}\gamma \frac{1}{t+t_0}\sum_{i=1}^t H_i
H_t \leftarrow \frac{1}{t+t_0}H_{t-1}+\frac{1}{t+t_0}(\delta-\alpha)

I did a quick test replacing \delta-\alpha with \min(δ−α,1−δ) and the warmup converges MUCH more quickly than before.

Prior to this change, if \alpha was close to zero in one of the steps during warmup, the stepsize would drop dramatically, and it would take many iterations each hitting the maximum treedepth for the stepsize to recover. This would happen dozens of times during the warmup. With this change, the stepsize is walked back much more gracefully when \alpha is too low. As a result, I can cut the number of warmup iterations in half without any apparent adverse effects.

2 Likes

Probably the key is trying s wide range of models, some pathological, otherwise it’s shooting in the dark. I imagine that’s what’s been done to establish those numbers in the first place…

I would guess that the problem I outlined would not come up non pathological models since the \alpha will track fairly closely with \epsilon (stepsize). Thus, they will never see a very low \alpha during the adaptation phase. However in pathological models there can be a lot of variance of \alpha, even for a fixed \epsilon.

In the forums, I’ve seen people discussing setting \delta as high as 0.99 for some models, and I imagine that in cases such as these, the warmup would be extremely inefficient.

1 Like

I see what you mean, but my point is that this kind of change is to a core component so it gets no traction without extensive testing. It is a cool demonstration of potential gains, and one of the weaknesses of the current set-up is that it does not have extensive background for how the current choices were made.

So why not test it on Stan’s Jenkins test-models and publish the results?
How many models are there 50? 100? I don’t see the point.
It would be good, if Aaron @aaronjg could publish the code changes.

Agreed. Even if this worked out to be better in all cases, I imagine it wouldn’d make it in as a default given how core this component is and since Stan has had this behavior from the very beginning.

Even though it seems to enable \alpha to converge to \delta more quickly, there may be cases where people are relying on the current behavior. For example, if someone sets \delta to 0.9, and doesn’t run the warmup until it has fully converged, they may end up with a \alpha of 0.95 since the step size kept dropping during warmup. Improving the convergence could end up with larger stepsizes and \alpha closer to the \delta they set. However, the larger step sizes may not be appropriate for the model and they could end up with divergent transitions. Obviously, this could be fixed by setting \delta to the appropriate value.

I’m going to do more testing, but if these results hold it would mean that I can shave 2 days off of a model that takes 7 days to run. I’ve spent a lot of effort trying to get this code to run faster, and this is by far the biggest improvement.

Before, I had to increase init_buffer from 75 to about 200 because I kept getting chains that would have step sizes that were way too small and would be far outside the typical set after the default number of iterations. Even with the large init_buffer, sometimes chains would still not have converged, but would usually be okay after a few adaptation windows that bumped up the stepsize by 10x. With this change, I’ve been able to go back to an init_buffer of 75, and shorten the rest of the warmup with apparently no adverse impact.

@andre.pfeuffer
I pushed the code to my git repo here:

I’m not sure how to run the jenkins tests though…

1 Like

Amazing. It runs the model with 1000 iterations in less time than the original implementation with 100.

Which model is this? I didn’t expect speedups this great, but I suppose it is possible if the original implementation hit max treedepth of 10 for most of the iterations throughout, and the optimal stepsize results in a treedepth of 3-4.

I tested a horseshoe model and a hierarchical model, I used rstan 2.17 against cmdstan 2.18 (patched). The speed gain may come also from switching from rstan to cmdstan. The number of divergent transitions increased significantly (from a few to 15%), also with setting init_buffer to 200, delta = 0.98, max treedepth = 15, stepsize = 0.01.
Overall, it has a speed gain, but makes the whole fitting less robust.

Thanks for testing this!

With a maxtreedepth of 15, those gains make more sense. The increase in divergences also makes sense, since the stepsize was likely much smaller before. What was the average acceptance probability after the warmup in the unpatched version? I’m guessing it is higher than 0.98. Could you set adapt_delta to that with the patched version and try again, this should take care of the divergent transitions, but may take away from some of the speed gains.

With the regularized horseshoe model, setting adapt_delta to 0.99 and (250/250) iterations I get 100% divergent transitions with (500/500) this goes down to 20%. I still kept the init_buffer at 200.
This makes me think to try with (2000/2000) iterations.
These horseshoe models are tough to fit. I even get divergences, when not setting adapt_delta close to 0.99 using the standard implementation.

BTW: Looking at your logit suggestion, what do you think about:
log(delta / alpha).
This is symmetric and close to the original values. As a variant one could use this with a
symmetric exponential growth function which does not change the sign of the value.

There are three things going on in adaptation: converging from the starting values into the typical set, adjusting the diagonal elements of the mass matrix for each parameter, and choosing the right stepsize. Each of these is dependent on the others happening, which is why warmup can be so tricky. In the case of your model, I’m not sure if more warmup would help, or why this change has improved things so much. Would it be possible to share plots of the log probability, the acceptance probability, number of leapfrog steps and stepsize throughout the warmup?

I don’t think \log(\delta/\alpha) is symmetric. It has a min value of \log(\delta) and a max value of \infty. The other concern I had with doing the logit was handling under and overflows. In the CSV outputs, I sometimes see \alpha of 0 and 1, which wouldn’t be handled nicely. It’s possible these haven’t over/underflowed in the C++ code, and it’s just truncated, but it’s something to think about. I imagine it’s also possible to get the log acceptance probability into the step size optimizer, which would fix the over/underflow issue, but it was a bit more work than I wanted to put in.

I’m not sure I follow the issue with the sign. \alpha and \delta are always [0,1]

I thought the Extended Kalman Filter would be a good source how to do so.

log(a/b) = -log(b/a). This is not different from the logit, where we have log(p / (1-p)).

I’ll post the graphics later.

The issue is that \alpha \in (0,1) so \log(δ/α) \in (\log(\delta),\infty), so after a very low \alpha, the step size would drop down to basically zero, and take forever to recover.

Something like \min(\log(δ/α),-\log(\delta)) might mitigate this though. I don’t have any sense how this would change things vs. my current scheme.

Thanks!

Ok I understand now.

logit^{-1}(logit(\delta) - logit(\alpha)) - \frac{1}{2}

should make it symmetric then.
Is there no optimal gain rule for adjusting?

There’s no reason that the stepsize updates have to be [0,1], so I’m not sure the logistic term is strictly necessary, but it’s probably good idea to keep the stepsize from bouncing around too much.

I don’t think there has been a lot of work on this since the early versions of Stan, actually. My understanding is that this was done pretty early on and was a huge improvement over manually trying different stepsizes, and there hasn’t been much tweaking of it since.

Here are a couple of papers that discuss the step size adjustment algorithm:

http://jmlr.org/papers/volume15/hoffman14a/hoffman14a.pdf

1 Like

Maybe my postings got off topic. So back on track:
Can the minimum create a discontinuity in optimization? I not feel fully comfortable about it.

(Meanwhile I solved the problem with the divergent transitions. Using regularized horseshoe and setting
nu_local and nu_global to 1, eg. inv_gamma(0.5, 0.5) results in extreme variance, setting
both to 2.0, eg. inv_gamma(1, 1) solved that problem)

First of all, I appreciate this thread. Recently, I came across the step size warmup too and, at first, was a little surprised about the behavior described above as

And then I also tested something similar to the min(\delta - \alpha, 1-\delta). However, this specific term is not yielding the correct results as you can see with a univariate standard Gaussian posterior quite easily (calling stan::services::sample::hmc_nuts_unit_e_adapt).

If you compare to the “correct” implementation as it is now in Stan, you will see that the step size will be much higher and that the average acceptance probability will be too low. If \delta is chosen to be 0.8, then I get typically an average acceptance probability of <0.7 during the warmup. And for NUTS at least, this should automatically result in a much faster warmup, because the tree building will exit much earlier.

I also think, considering the underlying HMC, it shouldn’t be a problem that once a step size is too large and the acceptance probability is 0 that the next step size choices are chosen to be much smaller.

If \delta is chosen to be close to 1 however, then we face a drop of -0.99 and a recovery of 0.01.
I think, considering how the Kalman filter gain works, the variance is needed, to adjust the growth of the adaptation rate. May I ask how we can call stan::services::sample::hmc_nuts_unit_e_adapt?