Request for Volunteers to Test Adaptation Tweak

Great, thanks, @bbbales2! This is pretty much consistent with my expectations. Note that to avoid the ESS/time variability it’s often easier to compare ESS/n_leapfrog – because the autodiff is fixed the only algorithmic differences will be in the number of leapfrog/gradient calls and not the time per each of those calls.

Additionally I avoid the older examples like the BUGS models and the radon model as the priors and parameterizations are terrible. Similarly I don’t consider anything model that yields divergences in performance comparisons – once it’s divergent the performance is irrelevant.

When building up empirical comparisons I instead focus on models that capture single qualitative features of interest – high dimension, high linear correlations, heavy tails, etc – as they make it easier to identify which features, if any, might cause problems.

For this tweak my comparison were

10 Dimensional IID Normal

Nominal:

                    Mean     MCSE   StdDev    5%       50%   95%
accept_stat__    8.9e-01  1.5e-03  1.1e-01  0.67   9.2e-01   1.0
stepsize__       6.8e-01  1.7e-14  1.2e-14  0.68   6.8e-01  0.68
treedepth__      2.8e+00  5.6e-03  3.7e-01   2.0   3.0e+00   3.0
n_leapfrog__     6.6e+00  1.8e-02  1.2e+00   3.0   7.0e+00   7.0
divergent__      0.0e+00      nan  0.0e+00  0.00   0.0e+00  0.00
energy__         1.0e+01  7.2e-02  3.2e+00   5.4   9.7e+00    16

99700 total effective samples
2.848 effective samples per gradient evaluation

Adapt Tweak:

                    Mean     MCSE   StdDev    5%       50%   95%
accept_stat__    8.7e-01  1.6e-03   1.3e-01 0.61   9.1e-01   1.0
stepsize__       8.2e-01  5.7e-15   4.0e-15 0.82   8.2e-01  0.82
treedepth__      3.9e+00  2.6e-02   1.7e+00 2.0    4.0e+00   6.0
n_leapfrog__     3.5e+01  6.4e-01   4.0e+01 3.0    2.3e+01   127
divergent__      0.0e+00      nan   0.0e+00 0.00   0.0e+00  0.00
energy__         1.0e+01  6.8e-02   3.2e+00 5.5    9.7e+00    16

51700 total effective samples (48% reduction)
1.477 effective samples per gradient evaluation (48% reduction)

100 Dimensional IID Normal

Nominal:

                    Mean     MCSE   StdDev    5%       50%   95%
accept_stat__    8.3e-01  2.0e-03  1.5e-01  0.56   8.6e-01   1.0
stepsize__       4.8e-01  1.3e-14  9.4e-15  0.48   4.8e-01  0.48
treedepth__      3.0e+00      nan  5.1e-14   3.0   3.0e+00   3.0
n_leapfrog__     7.0e+00      nan  9.7e-14   7.0   7.0e+00   7.0
divergent__      0.0e+00      nan  0.0e+00  0.00   0.0e+00  0.00
energy__         1.0e+02  2.5e-01  1.0e+01    83   1.0e+02   117

697300 total effective samples
19.922 effective samples per gradient evaluation

Adapt Tweak:

                    Mean     MCSE   StdDev    5%       50%   95%
accept_stat__    8.4e-01  1.8e-03  1.3e-01  0.60   8.7e-01   1.0
stepsize__       5.4e-01  5.3e-15  3.8e-15  0.54   5.4e-01  0.54
treedepth__      3.0e+00      nan  5.1e-14   3.0   3.0e+00   3.0
n_leapfrog__     7.0e+00      nan  9.7e-14   7.0   7.0e+00   7.0
divergent__      0.0e+00      nan  0.0e+00  0.00   0.0e+00  0.00
energy__         1.0e+02  2.4e-01  9.9e+00    84   1.0e+02   117

932700 total effective samples (33% improvement)
26.648 effective samples per gradient evaluation (34% improvement)

100 Dimensional IID Student t (5 degrees of freedom)

Nominal:

                    Mean     MCSE   StdDev    5%       50%   95%
accept_stat__    8.8e-01  1.5e-03  1.1e-01  0.67   9.1e-01  1.00
stepsize__       3.3e-01  9.4e-16  6.7e-16  0.33   3.3e-01  0.33
treedepth__      4.0e+00  2.0e-04  1.4e-02   4.0   4.0e+00   4.0
n_leapfrog__     1.5e+01  1.7e-02  1.1e+00    15   1.5e+01    15
divergent__      0.0e+00      nan  0.0e+00  0.00   0.0e+00  0.00
energy__         1.2e+02  3.2e-01  1.2e+01    98   1.2e+02   136

812700 total effective samples
23.220 effective samples per gradient evaluation

Adapt Tweak:

                    Mean     MCSE   StdDev    5%       50%   95%
accept_stat__    8.8e-01  1.7e-03  1.2e-01  0.65   9.2e-01  1.00
stepsize__       3.6e-01  9.6e-15  6.8e-15  0.36   3.6e-01  0.36
treedepth__      4.1e+00  6.0e-03  3.8e-01   4.0   4.0e+00   5.0
n_leapfrog__     2.0e+01  2.5e-01  1.5e+01    15   1.5e+01    47
divergent__      0.0e+00      nan  0.0e+00  0.00   0.0e+00  0.00
energy__         1.2e+02  3.2e-01  1.2e+01    97   1.2e+02   135

815300 total effective samples (0.3% improvement)
23.294 effective samples per gradient evaluation (0.3% improvement)

50 Dimensional Multi Normal, rho=0.25

Nominal:

                    Mean     MCSE   StdDev    5%       50%   95%
accept_stat__    8.4e-01  1.9e-03  1.4e-01  0.57   8.7e-01   1.0
stepsize__       5.1e-01  2.8e-15  2.0e-15  0.51   5.1e-01  0.51
treedepth__      3.0e+00  2.0e-04  1.4e-02   3.0   3.0e+00   3.0
n_leapfrog__     7.0e+00  3.6e-03  2.5e-01   7.0   7.0e+00   7.0
divergent__      0.0e+00      nan  0.0e+00  0.00   0.0e+00  0.00
energy__         5.0e+01  1.7e-01  7.2e+00    39   5.0e+01    62

364900 total effective samples
10.425 effective samples per gradient evaluation

Adapt Tweak:

                    Mean     MCSE   StdDev    5%       50%   95%
accept_stat__    8.5e-01  1.8e-03  1.3e-01  0.60   8.9e-01   1.0
stepsize__       5.5e-01  6.9e-15  4.9e-15  0.55   5.5e-01  0.55
treedepth__      3.0e+00      nan  5.1e-14   3.0   3.0e+00   3.0
n_leapfrog__     7.0e+00      nan  9.7e-14   7.0   7.0e+00   7.0
divergent__      0.0e+00      nan  0.0e+00  0.00   0.0e+00  0.00
energy__         5.0e+01  1.5e-01  7.1e+00    39   5.0e+01    62

437100 total effective samples (20% improvement)
12.488 effective samples per gradient evaluation (20% improvement)

10 Dimensional Multi Normal, rho=0.80

Nominal:

                    Mean     MCSE   StdDev    5%       50%   95%
accept_stat__    9.0e-01  1.2e-03  9.8e-02  0.70   9.3e-01   1.0
stepsize__       3.1e-01  8.6e-15  6.1e-15  0.31   3.1e-01  0.31
treedepth__      3.5e+00  1.0e-02  6.9e-01   3.0   4.0e+00   5.0
n_leapfrog__     1.7e+01  1.4e-01  9.7e+00   7.0   1.5e+01    31
divergent__      0.0e+00      nan  0.0e+00  0.00   0.0e+00  0.00
energy__         9.9e+00  7.3e-02  3.1e+00   5.4   9.6e+00    15

20100 total effective samples
0.574 effective samples per gradient evaluation

Adapt Tweak:

                    Mean     MCSE   StdDev    5%       50%   95%
accept_stat__    9.1e-01  1.1e-03  9.0e-02  0.73   9.4e-01   1.0
stepsize__       2.9e-01  7.9e-17  5.6e-17  0.29   2.9e-01  0.29
treedepth__      3.5e+00  1.1e-02  7.0e-01   3.0   3.0e+00   5.0
n_leapfrog__     1.7e+01  1.5e-01  9.9e+00   7.0   1.5e+01    31
divergent__      0.0e+00      nan  0.0e+00  0.00   0.0e+00  0.00
energy__         9.9e+00  7.4e-02  3.1e+00   5.3   9.6e+00    15

19200 total effective samples (4% reduction)
0.548 effective samples per gradient evaluation (4% reduction)

50 Dimensional Multi Normal, rho=0.80

Nominal:

                    Mean     MCSE   StdDev    5%       50%   95%
accept_stat__    8.6e-01  1.6e-03  1.2e-01  0.63   8.9e-01  1.00
stepsize__       2.3e-01  5.4e-15  3.8e-15  0.23   2.3e-01  0.23
treedepth__      4.7e+00  1.3e-02  8.0e-01   3.0   5.0e+00   6.0
n_leapfrog__     4.0e+01  3.7e-01  2.4e+01    15   3.1e+01    79
divergent__      0.0e+00      nan  0.0e+00  0.00   0.0e+00  0.00
energy__         5.0e+01  1.7e-01  7.0e+00    39   5.0e+01    62

182400 total effective samples
5.211 effective samples per gradient evaluation

Adapt Tweak:

                    Mean     MCSE   StdDev    5%       50%   95%
accept_stat__    8.4e-01  1.9e-03  1.4e-01  0.56   8.7e-01  1.00
stepsize__       2.8e-01  3.1e-16  2.2e-16  0.28   2.8e-01  0.28
treedepth__      4.7e+00  1.1e-02  7.0e-01   4.0   5.0e+00   6.0
n_leapfrog__     3.7e+01  3.9e-01  2.5e+01    15   3.1e+01    79
divergent__      0.0e+00      nan  0.0e+00  0.00   0.0e+00  0.00
energy__         5.0e+01  1.6e-01  6.9e+00    39   5.0e+01    62

229000 total effective samples (26% improvement)
6.542 effective samples per gradient evaluation (26% improvement)

Hopefully the pattern is clear and consistent with what you see. In high-dimensions the tweak improves performance unless the tails are really heavy, in which case the performance is the same. In low dimensions the tweak can reduce performance because the tree depth distribution expands and the sampler wastes time on overly long trajectories. This is due to the higher step size leading to the generalized No-U-Turn criterion to be evaluated at too crude a resolution, with the numerical trajectories continuing much too far before finally being terminated. At the same time this performance dip is in low dimensional models which already run extremely quickly.

1 Like

The requisite reading for what’s going on is https://arxiv.org/abs/1411.6669. There we showed that in the original Hybrid Monte Carlo scheme, where the numerical Hamiltonian trajectory is used as a Metropolis proposal, the optimal step size is the one that yields an average Metropolis acceptance probability of 0.8.

Both NUTS and the current implementation of Hamiltonian Monte Carlo used in Stan, however, are not Hybrid Monte Carlo schemes. Instead they sample from an entire trajectory whose length is grown dynamically around the initial point. Because of that the above analysis isn’t immediately applicable.

In order to use the above step size optimization criterion analysis one has to come up with a proxy Metropolis acceptance probability (incidentally this is which is why it’s called “accept_stat” and not “accept_prob” in the diagnostics). The current version of Stan pretends that each point in a given numerical trajectory could have been a Metropolis proposal and averages over the corresponding Metropolis acceptance probabilities uniformly.

We do not, however, sample states within a numerical trajectory uniformly in the algorithm. Instead we sample them based on their weights, e^{H(q, p)}. By weighting the fake Metropolis acceptance probabilities uniformly we over emphasize those states that are unlikely to be chosen, and end up with a conservative proxy acceptance statistic.

This new branch considers each state as a proxy proposal but weights the corresponding acceptance probabilities by the state weight, e^{H(q, p)}. In this way those states that are likely to be chosen have a greater influence on the proxy acceptance statistic, and any state with vanishing weight is not considered at all. This will always lead to a higher proxy acceptance stat for a given step size, which will cause the adaptation to push toward higher step sizes and, for a fixed integration time, cheaper numerical trajectories.

The effect of this change should be most evident in higher dimensions where the change in Hamiltonian values along a numerical trajectory increase. At the same time the optimality analysis is most accurate in higher dimensions (in the paper we utilize an expansion in step sizes but the coefficients of that expansion implicitly depend on dimension). This is the same as the 0.234 analysis of Roberts et al which is a purely asymptotic result and isn’t optimal for lower dimensional targets.

That said, the real issue here is that forcing the step size to larger values makes the generalized No-U-Turn criterion less accurate. Combined with the multiplicative trajectory expansion this can result in a heavy tail of longer numerical trajectories, especially in lower dimensions. Once again this is confirmed in the empirical results.

Ultimately the theory, which is better posed in higher dimensions, strongly motivates the reweighed acceptance statistic, which is backed up by empirical results. The cost is lower performance of low dimensional models, but these models run so quickly that the added cost isn’t particularly relevant.

Tomorrow I’ll update all of the tests and submit an issue/PR. Thanks again for the help with the empirical tests @jroon and @bbbales2!

3 Likes

Oh yeah, that would be a better way to do it.

I’m looking at this, by the way. The plan was to multiplicatively expand up to some treedepth, and then additively expand with trees of that treedepth (until some maximum number of leapfrog steps). Aki said you’d mentioned something like this?

Fair, but this model did work pretty well with the last sampler.

I had a look at the outputs of one chain that ran fast and one didn’t. The timestep adaptation doesn’t look like it’s going crazy (not crashing to zero or anything). The outputs of the two chains mix fine (according to Rhat), and the mass matrices are very close (all the individual entries are within a factor of two of each other).

I think this is worth investigating.

The divergence plots are there to show that this didn’t appear to be increasing the number of them.

This would handle them differently, right? Trajectories marked divergent are ones where there are leapfrog steps with huge energy errors? In the original method, these points would have close to a zero Metropolis accept probability and would cause the timestep to get smaller.

And what’d happen with this method? The leapfrog steps with huge energy error would just influence the timestep less? This makes me think if there was potential for divergences before, it’d be a little bit higher now. These tests weren’t really enough to probe this I guess.

1 Like

The problem is that waiting to additively expand until later doesn’t offer much computational advantage. You still have to save every state in the trajectory in order to accept those later additive expansions, and you still have to check the new state against every existing state. Doing a multiplicative expansion first just saves some relatively inexpensive calculations, not the really costly ones.

Yeah, really all of the old BUGS models need investigating!

If the inverse metrics and step sizes are nearly the same and there aren’t divergences then I don’t know what would be causing huge time differences.

Again the biggest issue with touching the step size concerns the dynamic trajectory expansion. Firstly by decreasing the step size you can hit a threshold where you need a little bit more integration time but can only achieve that by performing an entire trajectory doubling (the max cost difference here is then only about 2). By increasing the step size you can then potentially get significant speed ups if you were just past the threshold before, but at the same time larger step sizes make the termination criterion less accurate and hence you might accidentally expand past the point you care about. But again the speed difference should be less than a factor of 2.

Right – the divergent end of the trajectory would be downweighted in computing the accept stat, leading to higher accept stats and the adaptation pushing to higher step sizes. But the trajectory would still be marked as divergent, which is the most important thing.

The resolution, however, would still be the same. If the neighborhood of high curvature is sufficiently small then you might be able to get away with a higher adapt_delta but in most cases you’ll need to improve your priors and parameterizations. Controversial stance – if you need to run with adapt_delta > 0.9 then your model isn’t as identified as you want it to be.

Ultimately the number of divergences isn’t as important as whenever you see divergences or not.

Oh, well, I still gotta learn how all this treebuilding stuff works I s’pose. Otherwise it’ll always be handwavy to me.

Same. I thought it would be obvious once I popped open the save_warmup=1 csv, but it wasn’t.

The code changes are going to be simple enough to review, but can you have a look into this before I get to the Github review? Since this is an algorithm change we’ll have to talk to everyone before we incorporate it. Best that the case be as clear as possible.

If this is an unrelated, pre-existing issue that’s just been uncovered by setting the stepsize higher (which seems like what it is), that’s cool, we can just document it and set it aside. But I’d like to know how to trigger it.

I’ll run some tests in a couple days. I suspect we can trigger this just by using fixed adaptations and larger and larger stepsizes.

Whoops, realized I already had the data for this. Check out this plot (edit: these are all numbers for the radon model):

Time to run some more tests!

I’m about to board a flight to London and will respond proper tomorrow but if you plot all the combinatorics of {ess, ess per time, n_leapfrog, stepsize} then the underlying problem should be pretty evident.

Yup got it reproduced in a regular cmdstan and rstan by just lowering the adapt_delta so the timestep goes higher and also reproduced it outside of Stan. So this is definitely a problem that is outside the scope of this pull.

1 Like

Hi Michael,

I tried out the test on a slightly more complicated example but in this model it does not seem to make a lot of change. Below the results and the Julia program I used.

RequestMichael.txt (8.4 KB)

lba.jl.txt (8.6 KB)

Rob

Thanks. While the overall affect is small the step size is behaving as expected. Perhaps the biggest issue in this particular example is that the adaptation for the step size isn’t really reaching a good equilibrium in either case – the average acceptance stat is much bigger than the expected 0.8. Unless you have a non-default configuration of the sampler?

I tried to document everything with both text and figures in the conceptual paper but I’m happy to sit down and walk you through it when you get to town.

Definitely the default configuration:

julia> stanmodel.cmds[1]

/var/folders/fb/m164mwfn3v1btnrpmmyzpvqh0000gn/T/jl_tZ63kH/LBA sample num_samples=1000 num_warmup=1000 save_warmup=0 thin=1 adapt engaged=1 gamma=0.05 delta=0.8 kappa=0.75 t0=10.0 init_buffer=75 term_buffer=50 window=25 algorithm=hmc engine=nuts max_depth=10 metric=diag_e stepsize=1.0 stepsize_jitter=1.0 random seed=-1 init=2 id=1 data file=/var/folders/fb/m164mwfn3v1btnrpmmyzpvqh0000gn/T/jl_tZ63kH/LBA_data_1.R output file=/var/folders/fb/m164mwfn3v1btnrpmmyzpvqh0000gn/T/jl_tZ63kH/LBA_chain_1.csv refresh=100

Interesting, thanks for the hint.

Odd that the step size adaptation isn’t equilibrating well – I wonder if it’s the truncation in the normal priors inducing any nasty geometry. In any case for a model with so few parameters having no effect is actually a pretty good success!

Were you up for doing a review or should I look for someone else? Thanks!

The one last thing I wanted to get on this pull request was some idea how much more divergency a divergent model might be.

I think the test that would make me happy would just be run 100 8-schools with and without this and count the divergences. I can get to it in a few days if you don’t want to do it.

Looking at the code real quick, this is the line that doesn’t make sense to me yet:

log_sum_weight = math::log_diff_exp(log_sum_weight, 0);

(https://github.com/stan-dev/stan/pull/2790/files#diff-4eb84e42fa085ef0ffa363cee96699e9R166)

What’s happening here?

I’m teaching through the week and much of next week but might be able to sneak something in between next week if you don’t get to it in time.

The new acceptance statistic doesn’t consider transitions to the initial point, so the weight of 1 = exp(0) has to be removed from the running log_sum_weight aggregator.

Ooooh, okay, cool

This is a bit OT and I understand it might not be relevant for our adaptation, but maybe it could be addressed why we don’t need to this:

Why do we do our adaptation in “one-step” and not in cyclical manner as is usually done with DNN.

Like first few cycles gets in to the correct space and the last cycle gets correct values

3x = Adaptation cycle -> Adaptation cycle -> Adaptation cycle

It is done in cycles, kinda. The metric and timestep are done a bit differently. The cmdstan manual has the best graphic of how the metric adaptation is done. It’s Figure 9.3 in the 2.17.1 manual at least.

By default there are like 5~ windows of increasing size. The HMC draws from the nth stage are used to build the metric in the (n + 1)th stage.

Timestep is being adapted continuously (from draw to draw). It’s all sorts of funky games, which is why we gotta throw the warmup stuff away in the end.

1 Like

But if you have adaptation ideas, find a model and try em out. Hit up Aki on this stuff too. He’s all up to date.