Issue with dual averaging

There is value in setting it to some multiple, especially early on in the adaptation in the first few adjustments of the mass matrix, since often the updated mass matrix can result in a higher optimal step size, and this allows that step size to be reached more quickly.

It’s quite possible that having this as a tuneable parameter would solve much of the problem. Part of the reason I end up with such low stepsizes that take forever to recover from is that the step size is set too high at the beginning of the iteration, and then the momentum sends the step size to be far too low.

However the other part of the problem happens in the initialization phase.

There are certainly pathologies in the far tails of the model. There is an order of magnitude difference in the log-probability between the start of the warmup and where the model settles by the time sampling starts. There are plenty of divergences during warmup, but they are gone in sampling. My understanding from comments on the forum and that Stan does not report divergences during warmup is that divergences during warmup do not present a significant problem in model fitting. In looking into modifications to the warmup algorithm, I’m trying to see if there is a way to prevent these pathologies from causing the algorithm to take a really long time to reach the typical set.

To illustrate, here are some example plots of the sampling parameters.

This is a well-behaving model
It took over a week to run, so I haven’t tested it on the new warmup configuration. Note how long it takes for the stepsize to recover in chain 3 after it initially drops down, and the “sawtooth” pattern that keeps bringing it lower after the initial drop.

Here is an example of an especially problematic model
Notice how the stepsize for chain 3 keeps decreasing until the first 10x multiplier. It still requires many more leapfrog steps throughout the warmup and sampling. Dots are divergent transitions. So the model still as some issues, but having the warmup take so long really gets in the way of running the model to look at the diagnostics.

Here is the same model with the symmetric step size adjustment
In this case, chain 4 drops down, but recovers much more quickly. Chain 3 still seems to be requiring more leapfrog steps than the others, but overall it runs much faster than when using the default step size adjustment algorithm.


The change I proposed is just changing the objective function of the optimization from minimizing E[||\delta-\alpha||] to E[||\min( \delta-\alpha, 1-\delta)||]].

Whether this is in general better or not, I don’t yet know. However, it produced results that were very compelling on my model, and thought it was worth sharing. I’ve done a lot of tweaks to the model to eliminate pathologies, and tried various combinations of changing the three algorithm parameters, as well as the scaling the warmup windows, and this has by far been the single greatest improvement to the model fitting.

Interestingly, Andrieu and Thoms 2004, referenced in the Hoffman paper, also suggest using H_t = \mathbb I_{\delta-\alpha_t > 0} - \mathbb I_{\delta-\alpha_t \leq 0}, which I guess would be equivalent to minimizing E[|\delta-\alpha|]. This would require setting \gamma to a higher value to keep the updates on the same scales. I’ll look into this a bit more, in looking how well the stepsize was working I had been looking at the average acceptance stat during sampling. It was usually much higher than \delta, but the step size optimizer is minimizing ||\delta-\alpha||, so I guess this to be expected whenever \delta > 0.5.

Let me be clear, I am not advocating changing Stan based on this test in one model. I merely noticed some remarkable improvement in my model and thought it was worth sharing and looking into more. Even if it does turn out that another objective function is better for a wide variety of simple and pathological models, I still would advocate including it as another option rather than switching over the default. The existing code has been in Stan so long and there are so many models that depend on it that changing the defaults in such a core part of the code would certainly break someones models somewhere when they upgraded.

4 Likes

I completely agree, and I found the stepsize initialization quite puzzling in the beginning. If a standard warmup is chosen, then the initial stepsize chosen by the user has basically no influence on the algorithm. Additionally what I saw is that the initialization radius for the parameters is quite small and, depending on the model, it can be almost certain that the parameters are initialized outside the typical set. However, the stepsize adaptation starts already from the first iteration in the first adaptation window. Which make these sawtooth patterns very likely to occur.

I also saw that, but I didn’t look into it in more detail yet. I observed frequently that the average acceptance probability during sampling was > 0.9 whereas \delta = 0.8.

The problem with some hat we don’t have 50 distinct test models that we know work and comparing sampler output is tricky.

This would be great. It’s obviously easy to change or even more close to user control.

RStan shouldn’t be noticeably slower than CmdStan. If it is in your setup, we’d like to know.

Others have suggested starting with smaller treedepths during initial warmup. We haven’t played around a lot with adaptation since @betanalpha rebuilt it.

I would personally very much like to have a more continuous adaptation model than one based on epochs. Maybe something with a rolling average for covariance, continuous or at least frequent stepsize adaptation, and gradual increase of tree depth.

We just got a PR to add configuration for \mu, so that should be out in 2.19.

I find myself typically fitting the same model or same flavor of model multiple times as I tweak bits of it. In that case, it can be a big win to figure out the adaptation parameters.

To add on to this old conversation, one issue I noticed with dual averaging is that the final average (approximate) acceptance rate is almost always above the target adapt_delta.

When I look at the adaptation parameters (via get_sampler_params) it looks to me like the final window size is too small. Per the above discussion the algorithm starts high then drops initially to counter that, eventually coming back up to the target. But the final window (adapt_term_buffer) is only 50 by default and it looks like that is not long enough to stabilize. For instance here’s a simple 3 parameter multivariate model with 10 chains, warmup=1900 and iter=2000 using default settings in Rstan (so adapt_delta=0.8). The final step sizes range from 0.018 to 0.024, and the acceptance ratio ranges from 0.88 to 0.94 across the 10 chains, clearly all are much higher than the target of 0.8.

It seems to me that the step sizes are all too small because of that short final adaptation window, seen in this plot:

If I increase adapt_term_buffer to 1000, obviously extreme, then the final step sizes range from 0.027 to 0.035, and acceptance ratio ranges from 0.82 to 0.86. So bigger step sizes, which shifts closer to the target by about 0.6.

I’m not proposing a terminal buffer size of 1000. I just noticed this behavior and it is unexpected from a user (specify target of 0.8 and get >0.9), and it appears closely related to this discussion topic and how the step size changes during the early part of an adaptation window. It maybe worth considering this aspect of the adaptation when considering alternatives, or at a minimum increasing the default for adapt_term_buffer to be larger.

5 Likes

Thanks @monnahc, this is useful and clear report

It seems that, the final window could be smaller if before the final window the algorithm would not multiply the step size by 10. Multiplying by 10 is ad hoc safety measure in case the mass matrix updated is unexpected big. ping @bbbales2 as this is useful information for campfire, too.

1 Like

I’m no expert, but it seems to me that the biggest issue here is how Stan recalculates a “reasonable” step size at the beginning of each window. I can’t find this in the code, but in the original Hoffman paper this targets 0.5 after a single step. It looks to me that is what Stan does now based on the plots I showed with large initial step sizes. Despite being “reasonable” the initial few iterations end in divergences, presumably b/c at alpha=0.5 at a step, the true alpha for a whole trajectory is much lower. Then it drastically lowers epsilon b/c of the divergences, and then slowly recovers from that. If I replot the defaults (adapt_delta=0.8) and color by divergence this is apparent.

This is of course exacerbated if the user sets an adapt_delta=0.999. It looks like this:

Admittedly this is extreme, but the adapted epsilon goes from “reasonble” to something like 5 orders of magnitude smaller in the course of 10-15 iterations, before starting it’s slow recovery toward adapt_delta. To be fair, this model has high correlations (high of 0.997) so it is extreme. But even if I use metric=dense_e the same behavior persists:

I get the original intent of finding a reasonable step size, and why 0.5 may be a good idea at the very beginning b/c you’re not in the typical set. But full trajectories will typically have a lower acceptance rate than a single step, and we see this pattern of quick drop and slow recovery by nature of the adaptation algorithm. And that quick drop leads to computationally expensive trajectories and slow warmup, exactly what it’s trying to avoid. I would, naively, argue they are unreasonably large step sizes.

I would consider exploring two alternative options. (1) Set the target in find_reasonable_epsilon to be adapt_delta for all calls, or (2) after the initial call (first iteration), initialize from the last step size. I like 2 because while that might not be “reasonable” the first time Stan adapts after updating the mass matrix, it probably is the rest of the times. Presumably for my toy example epsilon adaptation would be more stable and overall Stan would be faster. Playing with the Stan source is a bit out of my comfort zone but both of these seem fairly easy to try.

Curious to see if anyone has experimented with this or has any thoughts.

5 Likes

Right now Stan just starts each interval with a timestep 10x the last timestep used in the previous interval. It’s hardcoded and not something you can change easily (https://github.com/stan-dev/stan/blob/develop/src/stan/mcmc/hmc/nuts/adapt_diag_e_nuts.hpp#L38). These plots you make are really compelling that we shouldn’t be doing this.

I always have sorta just ignored it cause 15 samples didn’t seem like that much, but especially with that last 50 timesteps of adaptation that logic seems really shortsighted.

I know that the metric can change quite a bit from warmup window to warmup window, but I agree it seems more sensible to use the timestep from the last window as the guess in the new one.

I assume the 10x comes from issues where the timestep was crashing to zero and then not recovering without this 10x boost.

1 Like

Thanks for taking the time to make these plots. @monnahc do you think you’d have time to try changing that 10x multiplier around and testing it?

Which Stan are you using? If you’re doing this with cmdstan I help you get set up with a version that let’s you change the 10x to whatever you want.

1 Like

Also @monnahc if you want some other models to test we have a bunch in the cmdstan performance testing suite

The dual averaging target is constant through the entire adaptation – the behavior @monnahc is noticing here is due to how the dual averaging is reinitialized at the end of every window to a large multiple of the previous adapted step size. The motivation is that after each window the metric is updated, which should to a better conditioned system that admits a larger step size.

The reinitialization to 10 times the previous step size is meant to aggressively probe large values which typically works well for well-behaved models. Even if the guess is too aggressive the dual averaging settles into an accurate value relatively quickly. As Cole notes in his empirically studies, however, the dual averaging can be much slower for less well-behaved models.

The easiest solution is to increase warmup and the terminal buffer by the same amount when encountering more challenging models to give the dual averaging more time to equilibrate. Alternatively the dual averaging parameters themselves can be tweaked to try to compensate for the structure of the objective function \bar{a} - a_{\text{opt}}(\epsilon).

Changing the defaults requires a more careful consideration of the span of well-posed and less posed-models being commonly used, which we want to prioritize with defaults, and what costs there are for having step sizes that can be underestimated.

2 Likes

@bbbales2 I’m not sure what you’re getting at with this:

I assume the 10x comes from issues where the timestep was crashing to zero and then not recovering without this 10x boost.

If it’s too small isn’t the concern that alpha will recover to adapt_delta in few iterations, but that the trajectories will be too long? I played around with it a bit outside of Stan and it’s not a silver bullet to just lower \mu. I think it would take some serious tinkering. I’ve never compiled Stan and typically use rstan but if these kinds of experiments, added to what was initially discussed in this thread, are of interest I could try to do it this summer (paternity leave looming at the moment).

@betanalpha I follow wrt to tweaking the other parameters. I think the easiest thing on the user end is to increase the terminal buffer to give it more time to equilibrate. I also get that changing a core Stan functionality is not something done lightly. I wanted to basically flag this issue for when/if a redesign of the adaptation was in play.

I would also push back a bit on the idea that this only happens for less well-behaved models. If you take an IID standard normal and run it through Stan you get the same issue. Here’s what the mean alpha over sampling iterations for 10 chains across 5 dimensions and 2 adapt_delta values.

This issue is pronounced for adapt_delta=0.8, where all chains have step sizes biased low. Interestingly that pattern doesn’t hold for 0.95.

That is only to say that I think this issue is more general than most folks expect, and that may motivate a proper investigation of the optimal adaptation scheme across a wide range of models.

4 Likes

Well, I did kinda make it up (edit: and so maybe it wasn’t the most coherent thought). I was just trying to figure out a reason for the 10x, and not a 1x.

Just seems like we’d do a 1x, and the adaptation-happens-quickly-at-the-beginning argument would mean that we’d equilibrate up quickly anyway.

That kinda happens anyway right now in those plots you made, cuz it looks like the timestep starts high and then crashes waaaaay down and has to climb again. So if it just started low and climb seems like the same thing, it’d just be closer to the answer.

Interesting plots.

Yeah if you want to do experiments I’m happy to make you a cmdstan with that factor of 10 as a parameter. It’s slightly more difficult but we could wrap it in a custom cmdstanr so you could use it from R as well. Just poke me if/when you get the time/desire to do more experiments.

2 Likes

I haven’t but you bring up an interesting point and have great plots to back it up. I completely agree we need to consider this in improving our adaptation algorithm.

Is there no way for it to go up otherwise?

Is there a way to make it something like an environment variable? Or is there another easy way to do this?

2 Likes

I know how to add cmdstan parameters, so that’s the easiest way for me :D. It’d just take 15 mins or something.

1 Like

Are you wanting to experiment? If so I’ll put up a cmdstan branch.

1 Like

One thing to keep in mind is that the dual averaging actually targets the logarithm of the step size, not the step size itself. By Jensen’s Inequality this implies that the procedure will, in general, yield (typically positively) biased estimates of the optimal step size. The behavior you see could just be the bias inherent to trying to adapt to the log step size.

To verify that the step size adaptation is being prematurely terminated one would have to show that the average acceptance statistic converges closer to 0.8 as the terminal window size is increased, ideally systematically over a wide range of varying conditions.

Another consideration is the practical consequence of the current setup. Does a longer terminal window, and hence a longer warmup, achieve better performance in the main sampling phase? The cost as a function of the target average acceptance statistic is relatively flat, so there often isn’t much cost to not achieving 0.8 exactly. Also keep in mind that the current adaptation criteria is suboptimal, but unfortunately the pul request implementing that fix was obstructed.

We might very well do better with a longer terminal window, but it’s not yet clear if that’s the case!

1 Like

Reviving this thread to share some potentially undesirable adaptation behavior that I’ve run into recently. I dug up this thread in order to link to it and realized that it’s all relevant.

Anyway, what I’ll show here is just from a single model, and I fully accept that it might be the case that (near-)optimal general-purpose adaptation behavior happens to fail for this particular model (i.e. one model can’t tell anybody if the adaptation algorithm is a good general-purpose algorithm or not). Still, I think this example reveals some behavior that’s worth mulling over.

This is a real-world model with about 250,000 parameters, fit to about 2 million data points. It takes days to fit (even with reduce_sum on an arbitrarily large number of cores), partly because it requires treedepths of 8 or so.

Here’s a plot of stepsize__ versus accept_stat__ for warmup iterations 500-1500 of a single chain run for 1500 warmup iterations (thus this is a snapshot of the reasonably well-adapted part of the windowed phase plus the term buffer):

Purple points resulted in treedepths of 9; orange points of 8, black points of 7, and red points diverged. The horizontal line is at acceptance probability 0.8, and the vertical line is at the adapted stepsize of 0.016, which was the result of this particular adaptation run. The line through the data is smooth (without much thought put into the smoothing parameters; I just want to show the location of the approximate mean). Here’s what I notice:

  • The adapted step-size is substantially smaller than the step-size that yields mean acceptance probabilities of 0.8.
  • In this model, the treedepth is very tightly controlled by the step-size, with extremely narrow regions of stochasticity, and likewise for divergences.

A couple of remarks:

  1. Maybe this is an edge case, but it suggests a class of models in which it isn’t true that:

If we could get the step-size up to about 0.027 we could cut the number of gradient evaluations in half. Now in this particular example that cross-over point happens to be very near where the mean acceptance probability crosses 0.8, but imagine if the transition point between treedepths of 7 and 8 were just a touch further to the left…
Would it be way too kludgy to entertain an adaptation algorithm that checks whether the adapted step-size is near a transition point between treedepths and uses some heuristic to decide whether to cheat it over a bit? This would produce a 2x speedup for free in models that happen to land right near one of these transitions.

  1. Here’s a plot of the tree-depths by iteration, again with divergences in red:


    The term buffer does something a bit strange. I understand the need to aggressively probe large step-sizes as the metric is updated, but why does it aggressively probe such small step-sizes? Adaptation has known for nearly 1000 iterations that it doesn’t need to use a step-size so small as to yield a treedepth of 9, yet in the 50-iteration term buffer it goes there 18 times! Even aside from concerns about whether the final adapted stepsize is a good choice, those extra 18*256 gradient evaluations (compared to staying in the realm of treedepth = 8) are a nontrivial amount of computation, on the order of several core-hours (albeit small compared to the entirety of the job).

  2. It’s pretty clear that the term buffer has not allowed the model to settle down to its stationary distribution of step-size exploration, much as was reported above.

  3. Given the first plot that I showed, would it be super-dangerous for me to run the warm-up, then eyeball this plot to select a step-size of .027, and then just use that for the sampling phase, enjoying 2-fold speedup in samples per second and nearly 2-fold speedup in effective samples per second?

Edited to add:
Upthread, there was some discussion of using step-size multipliers of smaller than 10 at the beginning of each window. Perhaps an alternative solution for the term buffer would be to bound the exploration from below. For example, in my case the term buffer spends nearly half of its time exploring step sizes that are smaller than any stepsize that produced an acceptance probably of less than 0.8 anywhere in the previous 1000 iterations! Surely it isn’t useful in general for the term buffer to clean out the cobwebs quite so deep in the basement.

2 Likes

The plot sure makes it seem like this should work. I am curious as well what you get.

I’m surprised by how sharp the transitions are between treedepths here.

It would be nice if this wasn’t the case. I forget why we always do doublings on the tree – it’s like the number of comparisons go through the roof if instead of doubling the number of steps you just add them in blocks?

@nhuurre

1 Like