Issue with dual averaging

@bbbales2 I’m not sure what you’re getting at with this:

I assume the 10x comes from issues where the timestep was crashing to zero and then not recovering without this 10x boost.

If it’s too small isn’t the concern that alpha will recover to adapt_delta in few iterations, but that the trajectories will be too long? I played around with it a bit outside of Stan and it’s not a silver bullet to just lower \mu. I think it would take some serious tinkering. I’ve never compiled Stan and typically use rstan but if these kinds of experiments, added to what was initially discussed in this thread, are of interest I could try to do it this summer (paternity leave looming at the moment).

@betanalpha I follow wrt to tweaking the other parameters. I think the easiest thing on the user end is to increase the terminal buffer to give it more time to equilibrate. I also get that changing a core Stan functionality is not something done lightly. I wanted to basically flag this issue for when/if a redesign of the adaptation was in play.

I would also push back a bit on the idea that this only happens for less well-behaved models. If you take an IID standard normal and run it through Stan you get the same issue. Here’s what the mean alpha over sampling iterations for 10 chains across 5 dimensions and 2 adapt_delta values.

This issue is pronounced for adapt_delta=0.8, where all chains have step sizes biased low. Interestingly that pattern doesn’t hold for 0.95.

That is only to say that I think this issue is more general than most folks expect, and that may motivate a proper investigation of the optimal adaptation scheme across a wide range of models.

4 Likes

Well, I did kinda make it up (edit: and so maybe it wasn’t the most coherent thought). I was just trying to figure out a reason for the 10x, and not a 1x.

Just seems like we’d do a 1x, and the adaptation-happens-quickly-at-the-beginning argument would mean that we’d equilibrate up quickly anyway.

That kinda happens anyway right now in those plots you made, cuz it looks like the timestep starts high and then crashes waaaaay down and has to climb again. So if it just started low and climb seems like the same thing, it’d just be closer to the answer.

Interesting plots.

Yeah if you want to do experiments I’m happy to make you a cmdstan with that factor of 10 as a parameter. It’s slightly more difficult but we could wrap it in a custom cmdstanr so you could use it from R as well. Just poke me if/when you get the time/desire to do more experiments.

2 Likes

I haven’t but you bring up an interesting point and have great plots to back it up. I completely agree we need to consider this in improving our adaptation algorithm.

Is there no way for it to go up otherwise?

Is there a way to make it something like an environment variable? Or is there another easy way to do this?

2 Likes

I know how to add cmdstan parameters, so that’s the easiest way for me :D. It’d just take 15 mins or something.

1 Like

Are you wanting to experiment? If so I’ll put up a cmdstan branch.

1 Like

One thing to keep in mind is that the dual averaging actually targets the logarithm of the step size, not the step size itself. By Jensen’s Inequality this implies that the procedure will, in general, yield (typically positively) biased estimates of the optimal step size. The behavior you see could just be the bias inherent to trying to adapt to the log step size.

To verify that the step size adaptation is being prematurely terminated one would have to show that the average acceptance statistic converges closer to 0.8 as the terminal window size is increased, ideally systematically over a wide range of varying conditions.

Another consideration is the practical consequence of the current setup. Does a longer terminal window, and hence a longer warmup, achieve better performance in the main sampling phase? The cost as a function of the target average acceptance statistic is relatively flat, so there often isn’t much cost to not achieving 0.8 exactly. Also keep in mind that the current adaptation criteria is suboptimal, but unfortunately the pul request implementing that fix was obstructed.

We might very well do better with a longer terminal window, but it’s not yet clear if that’s the case!

1 Like

Reviving this thread to share some potentially undesirable adaptation behavior that I’ve run into recently. I dug up this thread in order to link to it and realized that it’s all relevant.

Anyway, what I’ll show here is just from a single model, and I fully accept that it might be the case that (near-)optimal general-purpose adaptation behavior happens to fail for this particular model (i.e. one model can’t tell anybody if the adaptation algorithm is a good general-purpose algorithm or not). Still, I think this example reveals some behavior that’s worth mulling over.

This is a real-world model with about 250,000 parameters, fit to about 2 million data points. It takes days to fit (even with reduce_sum on an arbitrarily large number of cores), partly because it requires treedepths of 8 or so.

Here’s a plot of stepsize__ versus accept_stat__ for warmup iterations 500-1500 of a single chain run for 1500 warmup iterations (thus this is a snapshot of the reasonably well-adapted part of the windowed phase plus the term buffer):

Purple points resulted in treedepths of 9; orange points of 8, black points of 7, and red points diverged. The horizontal line is at acceptance probability 0.8, and the vertical line is at the adapted stepsize of 0.016, which was the result of this particular adaptation run. The line through the data is smooth (without much thought put into the smoothing parameters; I just want to show the location of the approximate mean). Here’s what I notice:

  • The adapted step-size is substantially smaller than the step-size that yields mean acceptance probabilities of 0.8.
  • In this model, the treedepth is very tightly controlled by the step-size, with extremely narrow regions of stochasticity, and likewise for divergences.

A couple of remarks:

  1. Maybe this is an edge case, but it suggests a class of models in which it isn’t true that:

If we could get the step-size up to about 0.027 we could cut the number of gradient evaluations in half. Now in this particular example that cross-over point happens to be very near where the mean acceptance probability crosses 0.8, but imagine if the transition point between treedepths of 7 and 8 were just a touch further to the left…
Would it be way too kludgy to entertain an adaptation algorithm that checks whether the adapted step-size is near a transition point between treedepths and uses some heuristic to decide whether to cheat it over a bit? This would produce a 2x speedup for free in models that happen to land right near one of these transitions.

  1. Here’s a plot of the tree-depths by iteration, again with divergences in red:


    The term buffer does something a bit strange. I understand the need to aggressively probe large step-sizes as the metric is updated, but why does it aggressively probe such small step-sizes? Adaptation has known for nearly 1000 iterations that it doesn’t need to use a step-size so small as to yield a treedepth of 9, yet in the 50-iteration term buffer it goes there 18 times! Even aside from concerns about whether the final adapted stepsize is a good choice, those extra 18*256 gradient evaluations (compared to staying in the realm of treedepth = 8) are a nontrivial amount of computation, on the order of several core-hours (albeit small compared to the entirety of the job).

  2. It’s pretty clear that the term buffer has not allowed the model to settle down to its stationary distribution of step-size exploration, much as was reported above.

  3. Given the first plot that I showed, would it be super-dangerous for me to run the warm-up, then eyeball this plot to select a step-size of .027, and then just use that for the sampling phase, enjoying 2-fold speedup in samples per second and nearly 2-fold speedup in effective samples per second?

Edited to add:
Upthread, there was some discussion of using step-size multipliers of smaller than 10 at the beginning of each window. Perhaps an alternative solution for the term buffer would be to bound the exploration from below. For example, in my case the term buffer spends nearly half of its time exploring step sizes that are smaller than any stepsize that produced an acceptance probably of less than 0.8 anywhere in the previous 1000 iterations! Surely it isn’t useful in general for the term buffer to clean out the cobwebs quite so deep in the basement.

2 Likes

The plot sure makes it seem like this should work. I am curious as well what you get.

I’m surprised by how sharp the transitions are between treedepths here.

It would be nice if this wasn’t the case. I forget why we always do doublings on the tree – it’s like the number of comparisons go through the roof if instead of doubling the number of steps you just add them in blocks?

@nhuurre

1 Like

I think the doublings have something to do with (checks notes) ensuring the transitions are reversible (waves hands) or something?

1 Like

Would it? Does an increase in tree depth by one always double the amount of work?

approximately yes. It doubles the number of gradient evaluations (provided there’s no divergence). There’s some extra work that happens per iteration (e.g. writing the output to disk) but generally the number of gradient evaluations dominates.

Yes, pretty much. The same final trajectory must be reachable from any starting point and requires checking every subtrajectory for early termination. The recursive doubling strategy makes that efficient because the subtrajectories then there are no overlapping subtrajectories.

1 Like

Isn’t the numerical integrator allowed to terminate early if some criterion is reached? To be honest, the recursion is tripping me up a bit (stan/base_nuts.hpp at develop · stan-dev/stan · GitHub), I haven’t fully understood it yet.

Yes, it can terminate early if there’s a divergence (as @jsocolar said) or some subtrajectory meets the u-turn criterion. Note that the latter means the final treedepth must vary a lot from sample to sample.

Hm, I don’t understand why this would be so, so I’ll think about it a bit. Thanks!

The trajectory terminates early if continuing would violate detailed balance; in other words the trajectory starting from point x stops if it reaches a point y such that a trajectory starting from y cannot reach x. That happens only when y is part of a subtrajectory that terminates at a lower treedepth. Both x and y are plausible candidate draws from the target distribution. Although a direct transition between x and y the sampler will eventually explore near both and finds different treedepths.

4 Likes

Appendix A here describes where the treedepth and doubling stuff comes from: https://arxiv.org/pdf/1701.02434.pdf

2 Likes

You’re misunderstanding “cost” here. Cost isn’t the total number of gradient evaluations but rather the number of gradient evaluations needed to for a hypothetical Metropolis proposal to be accepted, which in “nice” circumstances is proportional to the number of interactions needed to increase the effective sample size by a certain amount.

Moreover the adaptation target is not accept_stat at every point, but rather the average acceptance statistic across multiple transitions (also because Stan doesn’t to Metropolis proposals/corrections the acceptance statistic is not actually an acceptance probability, hence the intentionally vague name “statistic”). The pointwise values are, in general, very noisy characterizations of the average acceptance probability. Because of the asymmetry of the points where I’m guessing that the average each of those treedepth groups is larger than it might seem, and I’m not sure if the smoothed interpolation really captures that properly.

Cutting off the tree depth to reduce the number of gradient evaluations will, under those “nice” circumstances, reduce the total effective sample sizes of your Markov chains which will reduce the precision of your Markov chain Monte Carlo estimators. If you don’t need as much precision as you’re getting then this could help, as could reducing the total number of samples. That said you don’t need to modify the adaptation code to do this – you just set max_treedepth to 8 instead of the default of 10.

Capping max_treedepth, however, can compromise the statistical validity of your fit if your posterior density function has heavy tails. Heavy tails require long integration times to ensure that the sampler can always return to the bulk in a finite amount of time. Capping the maximum tree depth at too small a value can obstruct this return, resulting in a nontrivial probability that Markov chains become stuck out in the tails. More formally in order to maintain a Markov chain Monte Carlo central limit theorem you may need to keep the max_treedepth large enough.

One way to investigate this possibility is to plot the treedepth vs parameters to see if there’s any correlation. In the treedepth vs iteration plot you can see distinct periods where the tree depth is larger; that could correspond to the Markov chain sojourning out into the tails, although here I’m guessing that those periods are different adaptation phases.

The adaptation was designed to be robust, in particular the step size adaptation was made to be memoryless in case any of the inverse metric updates were bad. This gives the adaptation a chance to compensate for the poor adapted metric and maybe recover in later phases.

Early in the terminal window the dual averaging doesn’t have much inertia – bad trajectories can dominate the running adaptation target resulting in the large, sudden changes to the step size. The inertia of the dual averaging can be set with the configuration parameters in CmdStan if you wanted to play around with changing this, although the performance of dual averaging can be very sensitive to changes.

As discussed before the terminal buffer window can always be increased if you need more precise step size adaptation. In this case the adaptation is more challenging because of the divergences, which result in a discontinuous adaptation objective function is slows down the dual averaging convergence. There aren’t any divergences in your terminal buffer window but there are occasional short trajectories which are consistent with a sudden, asymmetric change in the adaptation objective function.

That assumes that the effective sample size wouldn’t change. If the issue here is purely noise in the adaptation then it would be fine, although one could confirm with larger terminal buffer windows.

Two subtleties here.

The adaptation target is the average acceptance statistic. For any fixed step size some trajectories might result in larger acceptance statistics and some might result in smaller acceptance statistics, and over any finite set of iterations you won’t see all of the acceptance statistics that could be generated from that fixed step size. In other words just because you didn’t see it doesn’t mean it’s not a typical value.

But more importantly those previous 1000 iterations were in the context of a different metric and so the typical behavior is not guaranteed to carry over in any nice way. If the adaptation is working and the target distribution is nice then, yes, the step size will often increase, at least on average. Those are a lot of conditions, however, that can’t be validated until after the fact.

2 Likes

@betanalpha I’m sure you’re right, but it’s not yet clicking for me. Suppose in a hypothetical model, trajectories with step-sizes greater than 0.025 are well-behaved (no divergences) and overwhelmingly terminate at treedepths of 7. Trajectories with step-sizes less than 0.025 are also well-behaved and overwhelmingly terminate at treedepths of 8. Now suppose that the adapted step-size happens to fall at 0.024. What do I lose if I cheat the step size up to 0.026? I would have thought that my trajectories are marginally less precise, resulting in marginally lower acceptance probabilities and marginally worse ESS per iteration. But if I’m getting marginally worse ESS per iteration while doing half as many gradient evals per iteration, isn’t that a huge gain, approaching 2x in the limit that the adapted step-size is immediately adjacent to an abrupt tree-depth transition?

Again, I don’t say any of this to argue, but rather to help you localize my own confusion. I really appreciate you taking the time to walk me through this.

The abrupt treedepth transition is also an abrupt transition in integration time.

Stepsize 0.26 terminates at depth 7 but 0.24 requires longer so depth 7 path must be shaped close to a half-circle. At least assuming the posterior is approximately gaussian, this gives very low correlation between successive samples and ESS close to total sample size. Sounds pretty good (but…)
On the other hand, depth 8 path would be twice as long and a full circle. In that case Stan’s multinomial sampling gives somewhat anticorrelated samples and that can increase ESS even higher. It’s entirely possible to have effective sample size be twice the nominal sample size, although only for some subset of parameters.
I don’t expect the change in ESS to be marginal, although I can’t say whether it undoes the benefit of half as many gradient evals.

1 Like