It’s a different (related) issue, but for those digging into this area, maybe it’s worth considering some kind of gradual rise in max_treedepth, to that specified. For some models with, I assume, a poorly scaled mass matrix, treedepth limits made effective warmup much faster.
Very interesting. I’ll look into this a little more. I wonder if it is because some of the steps from early on where \epsilon was set to 10x the previous stage were included in the calculation of the final epsilon. Do you have an example model, or is it just a single Gaussian parameter with no data? I’d like to run it in Rstan, since thats where all my diagonstic code lives.
It can be quite problematic in some more complicated models, especially early on in the warmup. For example, in the hierarchical model that I’m working on the correct step size after warmup is complete is \approx 10^{-3}, however early on in the warmup I can get stepsizes of \approx 10^{-8}, particularly at the beginning of adapt windows when \epsilon is set to 10\cdot\epsilon.
Yes! I’ve noticed this is also a possible workaround. Having the tiny step sizes is not such a problem when the max treedepth is smaller. I also tried doing a small stan run with a lower max treedepth and \delta of 0.5, then drawing a single sample and using that to initialize the longer Stan runs. It worked pretty well, but it seemed like fixing the step size optimization might be a cleaner fix. In retrospect, doing the two Stan runs may be better since it doesn’t require modifying the core Stan code.
The latter, yes:
parameters {
vector[1] x;
}
model {
real mu = 0.0;
real s = 1.0;
target += normal_lpdf( x[1] | mu, s );
}
I see the problem. As I understood it in the original NUTS publication, the initial stepsize is chosen to be relatively large to avoid small stepsizes that tend to need more computation time. However, that probably does not take into account the subsequent iterations where a very small stepsize, caused by a too large initial stepsize, takes very long to complete the transition. Here is the original quote:
We recommend setting \mu = \log(10\epsilon_1), since this gives the dual averaging algorithm a preference for testing values of \epsilon_1 that are larger than the initial value \epsilon_1. Large values of \epsilon cost less to evaluate than small values of \epsilon, and so erring on the side of trying large values can save computation.
Theoretically, there is no problem choosing \mu = \log(\epsilon_1) instead.
General wisdom when it comes to algorithms like this – the algorithm defaults are designed to work for the bulk of applications and there will always be edge cases where the defaults aren’t as effective. The corollary to this is that the defaults not being extremely effective in a few hard problems does not imply that they’re not working as intended.
Dual averaging, like many algorithms, is fragile in an algorithmic sense. If you start heuristically changing the updates then you lose all of the guarantees that make the correct implementation relatively robust. The knobs that you want to be turning here are not the dual averaging updates but rather the dual averaging configuration parameters, all of which are exposed through CmdStan (and I think the other interfaces as well).
Matt Hoffman spend a lot of time tweaking these to robust defaults and I played around with them to confirm that perturbations can have pretty deleterious effects. But if you want to reduce the inertia of the adaptation to avoid early failures, for example, then you can do this by tweaking the learning rate. That said, it will typically take multiple runs to figure these out and in that time you could have just let the default adaptation due its thing.
The fundamental issue here is that that behavior that you’re seeing is indicative of a model where the integrator error goes through a phase transition around a certain step size, changing rapidly from large to small. Because dual averaging relies on the local curvature of the objective function to inform each update, the local length scales of the objective function have to first be learned. Learning those length scales in models with these rapid changes is difficult and leads to the slightly erratic behavior seen in the initial updates before they can be pinned down reasonably well. The fact that it only takes a few iterations before dual averaging recovers really speaks to how robust these defaults are! Note that this intuition applies to any adaptation method that relies on local curvature that it has to learn.
One could also argue that these kinds of cliffs in the dual averaging objective function indicate that your model implementation has some weak pathologies hiding in your posterior, especially out in the tails where warmup is initialized, and that the right response is to clean up the model first. There’s always a fine line between tweaking the algorithm versus tweaking the model and the former is useful in some cases, but just something to keep in mind.
Anyways, have you played around with the \gamma, \kappa, and t_{0} dual averaging parameters to see if they are effective in improving adaptation for your problem? There is some discussion of their effects in the original NUTS paper.
Thank you for these two reminders!
However, regarding
You can change all parameters except how to set \mu. See https://github.com/stan-dev/stan/blob/develop/src/stan/mcmc/hmc/nuts/adapt_dense_e_nuts.hpp#L41 . Exposing this part would also be nice, but would probably entail to change not only all dependent code in the Stan repository itself, but also in the interfaces.
Good call – that would make a good issue for GitHub if you’re up for it. That said, is sounds like the problems being discussed had more to do with the learning rates gamma and kappa, and even t0 to some extent, which control how long the influence of mu persists in the adaptation.
To clear, there is no doubt room for improvement in the adaptation but robust improvements have to be supported by theoretical understanding and/or a boatload of empirical evidence to ensure that we at least maintain the status quo for most users.
In principle, yes. But to understand the scope a bit better, how should that change propagate to the interfaces?
I have added the issue https://github.com/stan-dev/stan/issues/2670 .
There is value in setting it to some multiple, especially early on in the adaptation in the first few adjustments of the mass matrix, since often the updated mass matrix can result in a higher optimal step size, and this allows that step size to be reached more quickly.
It’s quite possible that having this as a tuneable parameter would solve much of the problem. Part of the reason I end up with such low stepsizes that take forever to recover from is that the step size is set too high at the beginning of the iteration, and then the momentum sends the step size to be far too low.
However the other part of the problem happens in the initialization phase.
There are certainly pathologies in the far tails of the model. There is an order of magnitude difference in the log-probability between the start of the warmup and where the model settles by the time sampling starts. There are plenty of divergences during warmup, but they are gone in sampling. My understanding from comments on the forum and that Stan does not report divergences during warmup is that divergences during warmup do not present a significant problem in model fitting. In looking into modifications to the warmup algorithm, I’m trying to see if there is a way to prevent these pathologies from causing the algorithm to take a really long time to reach the typical set.
To illustrate, here are some example plots of the sampling parameters.
This is a well-behaving model
It took over a week to run, so I haven’t tested it on the new warmup configuration. Note how long it takes for the stepsize to recover in chain 3 after it initially drops down, and the “sawtooth” pattern that keeps bringing it lower after the initial drop.
Here is an example of an especially problematic model
Notice how the stepsize for chain 3 keeps decreasing until the first 10x multiplier. It still requires many more leapfrog steps throughout the warmup and sampling. Dots are divergent transitions. So the model still as some issues, but having the warmup take so long really gets in the way of running the model to look at the diagnostics.
Here is the same model with the symmetric step size adjustment
In this case, chain 4 drops down, but recovers much more quickly. Chain 3 still seems to be requiring more leapfrog steps than the others, but overall it runs much faster than when using the default step size adjustment algorithm.
The change I proposed is just changing the objective function of the optimization from minimizing E[||\delta-\alpha||] to E[||\min( \delta-\alpha, 1-\delta)||]].
Whether this is in general better or not, I don’t yet know. However, it produced results that were very compelling on my model, and thought it was worth sharing. I’ve done a lot of tweaks to the model to eliminate pathologies, and tried various combinations of changing the three algorithm parameters, as well as the scaling the warmup windows, and this has by far been the single greatest improvement to the model fitting.
Interestingly, Andrieu and Thoms 2004, referenced in the Hoffman paper, also suggest using H_t = \mathbb I_{\delta-\alpha_t > 0} - \mathbb I_{\delta-\alpha_t \leq 0}, which I guess would be equivalent to minimizing E[|\delta-\alpha|]. This would require setting \gamma to a higher value to keep the updates on the same scales. I’ll look into this a bit more, in looking how well the stepsize was working I had been looking at the average acceptance stat during sampling. It was usually much higher than \delta, but the step size optimizer is minimizing ||\delta-\alpha||, so I guess this to be expected whenever \delta > 0.5.
Let me be clear, I am not advocating changing Stan based on this test in one model. I merely noticed some remarkable improvement in my model and thought it was worth sharing and looking into more. Even if it does turn out that another objective function is better for a wide variety of simple and pathological models, I still would advocate including it as another option rather than switching over the default. The existing code has been in Stan so long and there are so many models that depend on it that changing the defaults in such a core part of the code would certainly break someones models somewhere when they upgraded.
I completely agree, and I found the stepsize initialization quite puzzling in the beginning. If a standard warmup is chosen, then the initial stepsize chosen by the user has basically no influence on the algorithm. Additionally what I saw is that the initialization radius for the parameters is quite small and, depending on the model, it can be almost certain that the parameters are initialized outside the typical set. However, the stepsize adaptation starts already from the first iteration in the first adaptation window. Which make these sawtooth patterns very likely to occur.
I also saw that, but I didn’t look into it in more detail yet. I observed frequently that the average acceptance probability during sampling was > 0.9 whereas \delta = 0.8.
The problem with some hat we don’t have 50 distinct test models that we know work and comparing sampler output is tricky.
This would be great. It’s obviously easy to change or even more close to user control.
RStan shouldn’t be noticeably slower than CmdStan. If it is in your setup, we’d like to know.
Others have suggested starting with smaller treedepths during initial warmup. We haven’t played around a lot with adaptation since @betanalpha rebuilt it.
I would personally very much like to have a more continuous adaptation model than one based on epochs. Maybe something with a rolling average for covariance, continuous or at least frequent stepsize adaptation, and gradual increase of tree depth.
We just got a PR to add configuration for \mu, so that should be out in 2.19.
I find myself typically fitting the same model or same flavor of model multiple times as I tweak bits of it. In that case, it can be a big win to figure out the adaptation parameters.
To add on to this old conversation, one issue I noticed with dual averaging is that the final average (approximate) acceptance rate is almost always above the target adapt_delta.
When I look at the adaptation parameters (via get_sampler_params) it looks to me like the final window size is too small. Per the above discussion the algorithm starts high then drops initially to counter that, eventually coming back up to the target. But the final window (adapt_term_buffer) is only 50 by default and it looks like that is not long enough to stabilize. For instance here’s a simple 3 parameter multivariate model with 10 chains, warmup=1900 and iter=2000 using default settings in Rstan (so adapt_delta=0.8). The final step sizes range from 0.018 to 0.024, and the acceptance ratio ranges from 0.88 to 0.94 across the 10 chains, clearly all are much higher than the target of 0.8.
It seems to me that the step sizes are all too small because of that short final adaptation window, seen in this plot:
If I increase adapt_term_buffer to 1000, obviously extreme, then the final step sizes range from 0.027 to 0.035, and acceptance ratio ranges from 0.82 to 0.86. So bigger step sizes, which shifts closer to the target by about 0.6.
I’m not proposing a terminal buffer size of 1000. I just noticed this behavior and it is unexpected from a user (specify target of 0.8 and get >0.9), and it appears closely related to this discussion topic and how the step size changes during the early part of an adaptation window. It maybe worth considering this aspect of the adaptation when considering alternatives, or at a minimum increasing the default for adapt_term_buffer to be larger.
Thanks @monnahc, this is useful and clear report
It seems that, the final window could be smaller if before the final window the algorithm would not multiply the step size by 10. Multiplying by 10 is ad hoc safety measure in case the mass matrix updated is unexpected big. ping @bbbales2 as this is useful information for campfire, too.
I’m no expert, but it seems to me that the biggest issue here is how Stan recalculates a “reasonable” step size at the beginning of each window. I can’t find this in the code, but in the original Hoffman paper this targets 0.5 after a single step. It looks to me that is what Stan does now based on the plots I showed with large initial step sizes. Despite being “reasonable” the initial few iterations end in divergences, presumably b/c at alpha=0.5 at a step, the true alpha for a whole trajectory is much lower. Then it drastically lowers epsilon b/c of the divergences, and then slowly recovers from that. If I replot the defaults (adapt_delta=0.8) and color by divergence this is apparent.
This is of course exacerbated if the user sets an adapt_delta=0.999. It looks like this:
Admittedly this is extreme, but the adapted epsilon goes from “reasonble” to something like 5 orders of magnitude smaller in the course of 10-15 iterations, before starting it’s slow recovery toward adapt_delta. To be fair, this model has high correlations (high of 0.997) so it is extreme. But even if I use metric=dense_e the same behavior persists:
I get the original intent of finding a reasonable step size, and why 0.5 may be a good idea at the very beginning b/c you’re not in the typical set. But full trajectories will typically have a lower acceptance rate than a single step, and we see this pattern of quick drop and slow recovery by nature of the adaptation algorithm. And that quick drop leads to computationally expensive trajectories and slow warmup, exactly what it’s trying to avoid. I would, naively, argue they are unreasonably large step sizes.
I would consider exploring two alternative options. (1) Set the target in find_reasonable_epsilon to be adapt_delta for all calls, or (2) after the initial call (first iteration), initialize from the last step size. I like 2 because while that might not be “reasonable” the first time Stan adapts after updating the mass matrix, it probably is the rest of the times. Presumably for my toy example epsilon adaptation would be more stable and overall Stan would be faster. Playing with the Stan source is a bit out of my comfort zone but both of these seem fairly easy to try.
Curious to see if anyone has experimented with this or has any thoughts.
Right now Stan just starts each interval with a timestep 10x the last timestep used in the previous interval. It’s hardcoded and not something you can change easily (https://github.com/stan-dev/stan/blob/develop/src/stan/mcmc/hmc/nuts/adapt_diag_e_nuts.hpp#L38). These plots you make are really compelling that we shouldn’t be doing this.
I always have sorta just ignored it cause 15 samples didn’t seem like that much, but especially with that last 50 timesteps of adaptation that logic seems really shortsighted.
I know that the metric can change quite a bit from warmup window to warmup window, but I agree it seems more sensible to use the timestep from the last window as the guess in the new one.
I assume the 10x comes from issues where the timestep was crashing to zero and then not recovering without this 10x boost.
Thanks for taking the time to make these plots. @monnahc do you think you’d have time to try changing that 10x multiplier around and testing it?
Which Stan are you using? If you’re doing this with cmdstan I help you get set up with a version that let’s you change the 10x to whatever you want.
Also @monnahc if you want some other models to test we have a bunch in the cmdstan performance testing suite
The dual averaging target is constant through the entire adaptation – the behavior @monnahc is noticing here is due to how the dual averaging is reinitialized at the end of every window to a large multiple of the previous adapted step size. The motivation is that after each window the metric is updated, which should to a better conditioned system that admits a larger step size.
The reinitialization to 10 times the previous step size is meant to aggressively probe large values which typically works well for well-behaved models. Even if the guess is too aggressive the dual averaging settles into an accurate value relatively quickly. As Cole notes in his empirically studies, however, the dual averaging can be much slower for less well-behaved models.
The easiest solution is to increase warmup and the terminal buffer by the same amount when encountering more challenging models to give the dual averaging more time to equilibrate. Alternatively the dual averaging parameters themselves can be tweaked to try to compensate for the structure of the objective function \bar{a} - a_{\text{opt}}(\epsilon).
Changing the defaults requires a more careful consideration of the span of well-posed and less posed-models being commonly used, which we want to prioritize with defaults, and what costs there are for having step sizes that can be underestimated.