Issue with dual averaging


#22

It’s a different (related) issue, but for those digging into this area, maybe it’s worth considering some kind of gradual rise in max_treedepth, to that specified. For some models with, I assume, a poorly scaled mass matrix, treedepth limits made effective warmup much faster.


#23

Very interesting. I’ll look into this a little more. I wonder if it is because some of the steps from early on where \epsilon was set to 10x the previous stage were included in the calculation of the final epsilon. Do you have an example model, or is it just a single Gaussian parameter with no data? I’d like to run it in Rstan, since thats where all my diagonstic code lives.

It can be quite problematic in some more complicated models, especially early on in the warmup. For example, in the hierarchical model that I’m working on the correct step size after warmup is complete is \approx 10^{-3}, however early on in the warmup I can get stepsizes of \approx 10^{-8}, particularly at the beginning of adapt windows when \epsilon is set to 10\cdot\epsilon.

Yes! I’ve noticed this is also a possible workaround. Having the tiny step sizes is not such a problem when the max treedepth is smaller. I also tried doing a small stan run with a lower max treedepth and \delta of 0.5, then drawing a single sample and using that to initialize the longer Stan runs. It worked pretty well, but it seemed like fixing the step size optimization might be a cleaner fix. In retrospect, doing the two Stan runs may be better since it doesn’t require modifying the core Stan code.


#24

The latter, yes:

parameters {
    vector[1] x;
}
model {
    real mu = 0.0;
    real s  = 1.0;
    
    target += normal_lpdf( x[1] | mu, s );
}

I see the problem. As I understood it in the original NUTS publication, the initial stepsize is chosen to be relatively large to avoid small stepsizes that tend to need more computation time. However, that probably does not take into account the subsequent iterations where a very small stepsize, caused by a too large initial stepsize, takes very long to complete the transition. Here is the original quote:

We recommend setting \mu = \log(10\epsilon_1), since this gives the dual averaging algorithm a preference for testing values of \epsilon_1 that are larger than the initial value \epsilon_1. Large values of \epsilon cost less to evaluate than small values of \epsilon, and so erring on the side of trying large values can save computation.

Theoretically, there is no problem choosing \mu = \log(\epsilon_1) instead.


#25

General wisdom when it comes to algorithms like this – the algorithm defaults are designed to work for the bulk of applications and there will always be edge cases where the defaults aren’t as effective. The corollary to this is that the defaults not being extremely effective in a few hard problems does not imply that they’re not working as intended.

Dual averaging, like many algorithms, is fragile in an algorithmic sense. If you start heuristically changing the updates then you lose all of the guarantees that make the correct implementation relatively robust. The knobs that you want to be turning here are not the dual averaging updates but rather the dual averaging configuration parameters, all of which are exposed through CmdStan (and I think the other interfaces as well).

Matt Hoffman spend a lot of time tweaking these to robust defaults and I played around with them to confirm that perturbations can have pretty deleterious effects. But if you want to reduce the inertia of the adaptation to avoid early failures, for example, then you can do this by tweaking the learning rate. That said, it will typically take multiple runs to figure these out and in that time you could have just let the default adaptation due its thing.

The fundamental issue here is that that behavior that you’re seeing is indicative of a model where the integrator error goes through a phase transition around a certain step size, changing rapidly from large to small. Because dual averaging relies on the local curvature of the objective function to inform each update, the local length scales of the objective function have to first be learned. Learning those length scales in models with these rapid changes is difficult and leads to the slightly erratic behavior seen in the initial updates before they can be pinned down reasonably well. The fact that it only takes a few iterations before dual averaging recovers really speaks to how robust these defaults are! Note that this intuition applies to any adaptation method that relies on local curvature that it has to learn.

One could also argue that these kinds of cliffs in the dual averaging objective function indicate that your model implementation has some weak pathologies hiding in your posterior, especially out in the tails where warmup is initialized, and that the right response is to clean up the model first. There’s always a fine line between tweaking the algorithm versus tweaking the model and the former is useful in some cases, but just something to keep in mind.

Anyways, have you played around with the \gamma, \kappa, and t_{0} dual averaging parameters to see if they are effective in improving adaptation for your problem? There is some discussion of their effects in the original NUTS paper.


#26

Thank you for these two reminders!

However, regarding

You can change all parameters except how to set \mu. See https://github.com/stan-dev/stan/blob/develop/src/stan/mcmc/hmc/nuts/adapt_dense_e_nuts.hpp#L41 . Exposing this part would also be nice, but would probably entail to change not only all dependent code in the Stan repository itself, but also in the interfaces.


#27

Good call – that would make a good issue for GitHub if you’re up for it. That said, is sounds like the problems being discussed had more to do with the learning rates gamma and kappa, and even t0 to some extent, which control how long the influence of mu persists in the adaptation.

To clear, there is no doubt room for improvement in the adaptation but robust improvements have to be supported by theoretical understanding and/or a boatload of empirical evidence to ensure that we at least maintain the status quo for most users.


#28

In principle, yes. But to understand the scope a bit better, how should that change propagate to the interfaces?


#29

I have added the issue https://github.com/stan-dev/stan/issues/2670 .


#30

There is value in setting it to some multiple, especially early on in the adaptation in the first few adjustments of the mass matrix, since often the updated mass matrix can result in a higher optimal step size, and this allows that step size to be reached more quickly.

It’s quite possible that having this as a tuneable parameter would solve much of the problem. Part of the reason I end up with such low stepsizes that take forever to recover from is that the step size is set too high at the beginning of the iteration, and then the momentum sends the step size to be far too low.

However the other part of the problem happens in the initialization phase.

There are certainly pathologies in the far tails of the model. There is an order of magnitude difference in the log-probability between the start of the warmup and where the model settles by the time sampling starts. There are plenty of divergences during warmup, but they are gone in sampling. My understanding from comments on the forum and that Stan does not report divergences during warmup is that divergences during warmup do not present a significant problem in model fitting. In looking into modifications to the warmup algorithm, I’m trying to see if there is a way to prevent these pathologies from causing the algorithm to take a really long time to reach the typical set.

To illustrate, here are some example plots of the sampling parameters.

This is a well-behaving model
It took over a week to run, so I haven’t tested it on the new warmup configuration. Note how long it takes for the stepsize to recover in chain 3 after it initially drops down, and the “sawtooth” pattern that keeps bringing it lower after the initial drop.

Here is an example of an especially problematic model
Notice how the stepsize for chain 3 keeps decreasing until the first 10x multiplier. It still requires many more leapfrog steps throughout the warmup and sampling. Dots are divergent transitions. So the model still as some issues, but having the warmup take so long really gets in the way of running the model to look at the diagnostics.

Here is the same model with the symmetric step size adjustment
In this case, chain 4 drops down, but recovers much more quickly. Chain 3 still seems to be requiring more leapfrog steps than the others, but overall it runs much faster than when using the default step size adjustment algorithm.


The change I proposed is just changing the objective function of the optimization from minimizing E[||\delta-\alpha||] to E[||\min( \delta-\alpha, 1-\delta)||]].

Whether this is in general better or not, I don’t yet know. However, it produced results that were very compelling on my model, and thought it was worth sharing. I’ve done a lot of tweaks to the model to eliminate pathologies, and tried various combinations of changing the three algorithm parameters, as well as the scaling the warmup windows, and this has by far been the single greatest improvement to the model fitting.

Interestingly, Andrieu and Thoms 2004, referenced in the Hoffman paper, also suggest using H_t = \mathbb I_{\delta-\alpha_t > 0} - \mathbb I_{\delta-\alpha_t \leq 0}, which I guess would be equivalent to minimizing E[|\delta-\alpha|]. This would require setting \gamma to a higher value to keep the updates on the same scales. I’ll look into this a bit more, in looking how well the stepsize was working I had been looking at the average acceptance stat during sampling. It was usually much higher than \delta, but the step size optimizer is minimizing ||\delta-\alpha||, so I guess this to be expected whenever \delta > 0.5.

Let me be clear, I am not advocating changing Stan based on this test in one model. I merely noticed some remarkable improvement in my model and thought it was worth sharing and looking into more. Even if it does turn out that another objective function is better for a wide variety of simple and pathological models, I still would advocate including it as another option rather than switching over the default. The existing code has been in Stan so long and there are so many models that depend on it that changing the defaults in such a core part of the code would certainly break someones models somewhere when they upgraded.


#31

I completely agree, and I found the stepsize initialization quite puzzling in the beginning. If a standard warmup is chosen, then the initial stepsize chosen by the user has basically no influence on the algorithm. Additionally what I saw is that the initialization radius for the parameters is quite small and, depending on the model, it can be almost certain that the parameters are initialized outside the typical set. However, the stepsize adaptation starts already from the first iteration in the first adaptation window. Which make these sawtooth patterns very likely to occur.

I also saw that, but I didn’t look into it in more detail yet. I observed frequently that the average acceptance probability during sampling was > 0.9 whereas \delta = 0.8.


#32

The problem with some hat we don’t have 50 distinct test models that we know work and comparing sampler output is tricky.


#33

This would be great. It’s obviously easy to change or even more close to user control.


#34

RStan shouldn’t be noticeably slower than CmdStan. If it is in your setup, we’d like to know.

Others have suggested starting with smaller treedepths during initial warmup. We haven’t played around a lot with adaptation since @betanalpha rebuilt it.

I would personally very much like to have a more continuous adaptation model than one based on epochs. Maybe something with a rolling average for covariance, continuous or at least frequent stepsize adaptation, and gradual increase of tree depth.

We just got a PR to add configuration for \mu, so that should be out in 2.19.

I find myself typically fitting the same model or same flavor of model multiple times as I tweak bits of it. In that case, it can be a big win to figure out the adaptation parameters.