# Comparing Stan's adaptation phase to that of nuts-rs?

I’ve heard others report that the adaptation phase used in nuts-rs is providing signifiant speed increases over that used in Stan itself. Here’s the main statement I’ve found about how nuts-rs changes its adaptation:

This crate mostly follows the implementation of NUTS in Stan and PyMC, only tuning of mass matrix and step size differs: In a first window we sample using the identity as mass matrix and adapt the step size using the normal dual averaging algorithm. After discard_window draws we start computing a diagonal mass matrix using an exponentially decaying estimate for sqrt(sample_var / grad_var) . After 2 * discard_window draws we switch to the entimated mass mass_matrix and keep adapting it live until stop_tune_at .

I’ve not seen any generalizable demonstrations yet on performance or tradeoffs, but it seems useful to have a discussion here.

2 Likes

I believe @roualdes played around with his own implementation of the idea and did some benchmarking

3 Likes

If you stick to a diagonal mass matrix, that sounds exactly like Stan’s algorithm, which has three phases:

1. Use an identity mass matrix and adapt step size.
2. In increasing sized windows, sample and adapt the mass matrix.
3. Fine-tune step size.

In many problems we don’t want to go up to a dense mass matrix because it turns the leapfrog algorithm from \mathcal{O}(N) to \mathcal{O}(N^2) in N dimensions. It also requires a matrix solve at \mathcal{O}(N^3).

There are several ongoing projects around adaptation. We’re about to roll out Pathfinder in the next release of Stan, the goal of which is to find reasonable initializations (it’s also a black box variational inference algorithm) much faster than Stan—it beats Stan by one to two orders of magnitude in terms of getting to the bulk of the probability mass. See: Pathfinder: Parallel quasi-Newton variational inference

Matt Hoffman ( @matthewdhoffman ) and Pavel Sountsov have some really cool ideas for adaptation in their MEADS sampler both for step size in terms of eigenvalues and metrics in terms of complementary chains in an ensemble: https://proceedings.mlr.press/v151/hoffman22a/hoffman22a.pdf

Then there’s also the cross-chain adaptation work that Ben Bales (@bbbales2) did in his thesis, which he and Yi Zhang ( @yizhang ) were working on.

Edward Roualdes ( @roualdes ) is also working on adaptation for a version of generalized HMC (partial momentum refresh) with delayed rejection, which is complementary to the recent work of Radford Neal on non-reversible acceptance for GHMC: [2001.11950] Non-reversibly updating a uniform [0,1] value for Metropolis accept/reject decisions

5 Likes

Having looked at the paper now, it doesn’t match the informal description. Rather than initializing at unit metric, it uses a gradient-based variant estimator to start at a diagonal mass matrix. And it’s not using a sample-based estimator of covariance, like we use in Stan, but something called “Fisher divergence”, which as far as I can tell from the formula, is like Kullback-Leibler divergence, but with the difference in log densities squared. At least that’s how it looks from the first formula on page 2, but then the difference seems to be mediated by the metric (covariance) being fitted, which you see in the second unnumbered formula on page 2.

The discussion reminds me of what @bbbales2 tried to do in his thesis using Hessians instead of just the draws—I don’t think he ever got that to work, but maybe the trick’s using something simpler like Adrian Seyboldt is doing here. I’m curious about why a first-order method is used (gradients) to estimate variance rather than second-order methods, which are exact in the Gaussian case.

I think the plot on page 6 with empirical results has different models on the y axis (so I don’t know why there are lines connecting them instead of a bar or dot plot like we used in the Pathfinder paper). I’m guessing they’re from posteriordb, but the reference in the paper is just “??”. Those are all fairly low dimensional, so I’m interested in how this would work in something like 1000 dimensions where the cost gets much higher. To that end, I’d like to see something like the number of gradient evals required to get to a draw whose log density is contained in the central 99% posterior interval of log densities. That’s what we used for evaluation in Pathfinder as it sidesteps whether the algorithm’s default number of steps is right (e.g., 75 Phase I warmup steps in Stan).

The paper isn’t really a complete paper—section 3 is just sketchy notes and that’s where most of the action is. It’d be great to hear from Adrian Seyboldt about what he’s thinking about this.

3 Likes

The paper isn’t really a complete paper—section 3 is just sketchy notes and that’s where most of the action is.

I had to work on other things for a while and didn’t get to it yet unfortunately, I’m sorry I left that so incomplete. (Also, I’m really not good at writing papers, but if people are interested in this, I think I’ll find the motivation to finally get this over the finish line…)

I’m in the middle of a move this week, but I’d love to have a more detailed discussion about all this next week.

A couple quick points right now:

• Nuts-rs (and with it nutpie) uses a diagonal mass matrix, so it should not scale any worse than stan in high dimensions (I tried quite a few real world models, and results looked overall similar to the results from posteriordb in the paper). The plots there are ecdf plots, I chose those because I though they were easier to read than boxplots with several adaptation methods and a large number of models, but looks like I might have been wrong about that… :-)
• nutpie/nuts-rs changes both some details of the early mass matrix adaptation (the inital estimate based on the gradients, and some changes in how the windows work), and also changes what diagonal mass matrix we are trying to converge to. Stan tries to find diag(Cov(draws))), and nutpie tries to find sqrt(diag(Cov(draws)) / diag(Cov(grads))). I’m pretty sure at this point that the target of nutpie is generally superior (although not by that much). The early tuning changes also seem to help though, it takes much fewer gradient evaluations to find a decent mass matrix.
• The fisher divergence idea is more of a justification for the changed target of the mass matrix adaptation, and also allows us to generalize it to other families of mass matrices (diag plus low rank or full mass matrix). I experimented quite a lot with those, and it looks very promising to me, but I don’t have a decent implementation yet (there is quite a bit of overlap with @bbbales2’s work on this). I think it also should generalize to nonlinear transformations of the posterior pretty nicely. (I’d be really curious how a fisher divergence based pathfinder would work out by the way).
• I could be completely wrong on this point, but it does seem to me that the fisher divergence with correctly chosen norm might be a better target than the KL divergence in general when we try to transform parameter spaces for better sampling. Not sure I’m at a point yet where I can really explain why that might be so though…
2 Likes

Yep, that seems about right, though I didn’t do too many experiments.

Here’s how I’ve been thinking about the proposed estimator sqrt(diag(Cov(draws)) / diag(Cov(grads))). Please, correct me if any of this seems off.

1 / sqrt(diag(Cov(grads))) seems like regularizer of sorts. In the Gaussian case, it’s arguably a perfect regularizer.

For non-Gaussian targets and when Cov(grads) is non-zero and finite, is the proposed estimator a scalar multiple of sqrt(diag(Cov(draws)))? If so, the proposed estimator won’t change sampler mixing much as the stepsize adaptation will make up the difference. On the other hand, if the regularization stabilizes mass matrix adaptation, then this may stabilize stepsize adaptation. More stable adaptation could imply fewer warmup iterations.

Will it always stabilize stepsize adaptation? Is Cov(grads) guaranteed to exist? Cov(draws) certainly isn’t guaranteed to exist.

What if the proposed estimator is not a scalar multiple of sqrt(diag(Cov(draws)))? Then stepsize adaptation can’t make up the difference. The implications are less clear here for adaptation stability and sampler mixing.

@aseyboldt, when you can, will you

1. provide more specifics for the default values of your alternative warmup phase? and
2. provide the Stan programs corresponding to the models in Figure 2? If the Stan programs are from posteriordb, would you provide model names?
2 Likes

I did work as demonstrated in [1905.11916] Selecting the Metric in Hamiltonian Monte Carlo, but it would have required additional library dependency and a lot of work to get it in Stan, and he ended up doing other things.

There is overlap, but your approach is simpler to implement and thus would be also easier to include in Stan. There is also a connection to a control variate variance reduction approach that uses gradients. The reduced variability in the marginal variance estimates helps especially in high dimensions as the expected worst case (over parameters) mismatch between the estimated and actual scale increases with the number of dimensions.

3 Likes

@roualdes I’ve never really thought of 1/sqrt(diag(Cov(grads))) as a regularizer, not sure if that is helpful or not. It usually is not just a scalar multiple of sqrt(diag(Cov(draws))) though (unless we really have a diagonal Gaussian posterior), so the step size can’t just make up the difference. A big part of the lower number of tuning logp evals in nuptie is because stan often has a phase pretty early on where it builds very long trajectories with quite small step sizes, and burns a lot of logp evals at that point. This still happens with nuptie a bit, but way less. In addition nutpie tends to choose somewhat shorter trajectories even after tuning though.

I tend to think of the 1/sqrt(diag(Cov(grads))) along some of those lines:

1. I like to think of the mass matrix adaptation as a stochastic optimizer, where we try to learn a normalizing flow that transforms the posterior to something close to a standard normal distribution (only we don’t really care about the mean). nutpie’s mass matrix adaptation then just corresponds to minimizing the fisher divergence between N(0, 1) and the transformed posterior, where we compute the integral over the posterior instead of over N(0, 1) (as we would in ADVI, which I think is arguably the wrong way to do it). In the case of a diagonal mass matrix, the family of transformations is just an affine function for each parameter. In this view the extra 1/sqrt(diag(Cov(grads))) is something that pops out if we do that optimization analytically. This way of thinking about it would also be flexible enough to work with any normalizing flow, although often we probably can’t do the optimization analytically anymore.

2. Less formal, but maybe more intuitive is to look at the distribution of the gradients. Whenever we sample the posterior, we implicitly also sample the probability distribution of the gradients as some kind of dual of the original distribution. The distribution of gradients seems like quite an interesting object in its own right by the way, and can be quite useful when figuring out sampling issues (and if anyone knows of some literature about this, I’d be curious, I never really found a lot about it, but maybe I’m just using the wrong nomenclature to google this :-) ). We can then think of the mass matrix as something that makes the posterior “nice” in some way. We could say that a distribution is “nice” if all marginals have standard deviation one. This is what Stan’s mass matrix adaption is trying to achieve. But we could also say that a distribution is “nice” if its gradients have standard deviation one, and we can show that this would imply using 1/diag(Cov(grads)) as inverse mass matrix. (This actually also seems to work quite well in practice, I didn’t test it much though). Nutpie uses a compromise between those two, namely the geometric mean. We can also generalize this to full mass matrices, we just have to replace the geometric mean by something that works on full covariance matrices, for instance based on the affine-invariant metric on the manifold of symmetric positive matrices. (see for instance https://www.math.fsu.edu/~whuang2/pdf/ASPDM_techrep.pdf).

3. I guess a bit of a special case of the previous one: We can also just look at the case of a multivariate normal posterior, \text{draws} \sim N(\mu, \Sigma). Here we can ask what diagonal matrix best approximates \Sigma. We could for instance just choose D_\Sigma = \text{diag}(\Sigma), but we could also choose D_{\Sigma^{-1}} = \text{diag}(\Sigma^{-1})^{-1}, ie try to approximate the precision matrix using a diagonal matrix. For a normal posterior we can show that \text{grads} \sim N(0, \Sigma^{-1}), so D_{\Sigma^{-1}} corresponds to the diagonal of the gradient covariance matrix. Given a metric on the space of positive definite matrices we can just look for a diagonal matrix D that minimizes the distance to \Sigma. With the affine-invariant metric this is exactly the geometric mean of the two diagonal matrices. This also happens to minimize \lVert \log(\text{eigvals}(D, \Sigma))\rVert, which is closely related to the condition number of the posterior with the diagonal mass matrix.

• provide more specifics for the default values of your alternative warmup phase? and

Coming soon…

• provide the Stan programs corresponding to the models in Figure 2? If the Stan programs are from posteriordb, would you provide model names?

The models are from posteriordb, and I just used the vast majority of the models in there. I filtered out models where no sampler converges (ie when there always were divergences or if the effective sample size was too small with all samplers), and I excluded a handful of models because they just took too long, and I needed my computer for something else as well (stan was usually way slower in terms of wall time though, so hopefully that didn’t bias the results much). Also, models using sundials weren’t working with nutpie when I did those benchmarks, so they are excluded as well.

I don’t think either exclusion changed to overall picture much. I’ll rerun them though, also with an up-to-date version of nutpie, there still were a few minor changes since then.

Will it always stabilize stepsize adaptation? Is Cov(grads) guaranteed to exist? Cov(draws) certainly isn’t guaranteed to exist.

It is perfectly possible for \text{Cov}(\text{grads}) not to exist, I really doubt that any non-Riemannian HMC sampler could possibly converge in that case though, no matter which mass matrix we choose…

5 Likes

I ran the models from posteriordb again, here a summary of the results:

The code to run the benchmarks is here, the notebook that makes the plots is here.

I used all posteriors in posteriordb, except the following (either because something failed, or because they took too long to sample. I will probably rerun the slower ones soon as well):

too_slow = {
'ecdc0401-covid19imperial_v2',
'ecdc0401-covid19imperial_v3',
'ecdc0501-covid19imperial_v2',
'ecdc0501-covid19imperial_v3',
'election88-election88_full',
'hmm_gaussian_simulated-hmm_gaussian',
'iohmm_reg_simulated-iohmm_reg',
'mnist-nn_rbm1bJ100',
'mnist_100-nn_rbm1bJ10',
'prideprejudice_chapter-ldaK5',
'prideprejudice_paragraph-ldaK5',
'prostate-logistic_regression_rhs',
}
failed = {
'hudson_lynx_hare-lotka_volterra',
'science_irt-grsm_latent_reg_irt',
'sir-sir'
}


This leaves us 132 posteriors to sample.

All of them were run using stan and nutpie, each with 300, 400, 600 and 1000 warmup steps. All combinations with 10 chains and 1000 draws after warmup, on a Threadripper 1920X with 12 physical cores.

## Comparing the final mass matrix

First, we can have a look at how well the final mass matrix is working, to mostly exclude changes to the adaptation schema of nutpie. So for the most part this should just show the difference between using diag(cov(draws)) and sqrt(diag(cov(draws)) / diag(cov(grads))) as adaptation target.

I’m using ecdf plots to visualize the distribution of different parameters for the sampler settings in the following.

The absolute effective sample size tends to be a bit smaller with nutpie:

So for 1000 samples we tend to get fewer effective samples from nutpie than we would get from stan.
Curiously this mostly happens for models with effective sample sizes a bit above 1000, below and above there doesn’t seem to be a big difference. No clue what that might be about…

However, stan tends to use more leapfrog steps for each draw, so each draw in stan tends to be more expensive to compute. If we correct for this by looking at the number of effective samples per gradient evaluation (or per time, this looks almost identical), we get an improvement:

If we normalize to stan with 1000 warmup steps by dividing by the effective sample size per gradient evaluation in stan, we get the following:

I think a clear and meaningful improvement, but certainly not a game-changer either.

## Comparison including the warmup phase

Nutpie consistently needs fewer gradient evaluations during warmup:

We can again normalize using stan with 1000 warmup steps:

This is the effective sample size per total number of gradient evaluations (so the number of gradient evaluations during warmup plus the number of gradient evaluations during sampling):

If we normalize by stan again:

Nutpie with 400 warmup steps gives us a geometric mean speedup of about 2x compared to stan with 1000 warmup steps (and usually I think those 400 warmup steps are good enough).

The number of effective samples per total sampling time looks quite similar:

It might also be interesting to have a closer look at those models, where there were big differences between stan and nutpie:

# Stan with 1000 tuning steps was doing better
ovarian-logistic_regression_rhs                 -4.856620
seeds_data-seeds_stanified_model                -4.138525
one_comp_mm_elim_abs-one_comp_mm_elim_abs       -3.272695

# Nutpie with 1000 tuning steps was doing better
loss_curves-losscurve_sislob                            2.236644
wells_data-wells_dist                                   2.291067
sblri-blr                                               3.351181
GLMM_Poisson_data-GLMM_Poisson_model                    3.662405
mcycle_gp-accel_gp                                      6.633426


From a quick glance it looks like those were cases where one chain in either sampler is doing something strange.
In general I’m getting the impression that the default of sampling 4 chains might be a bit low, and might hide sampling issues in quite a few models…

6 Likes

Thanks for sharing the results! These are very useful!

As you were obtaining 1e4 draws, ESS close to 1e4 means an easy posterior (close to normal and probably also low dimensional) and ESS much less than 1e3 means a difficult posterior (far from normal with non-linear dependencies / funnels / multimodality / etc and probably also higher dimensional). In case of an easy posterior ESS is not sensitive to the scaling, and in case of difficult posterior the scaling alone is not sufficient to make the posterior easier to sample.

Even for the easy posteriors, having a noisier estimate of the scale means that due to the mismatch in some directions the stepsize adaptation is making the stepsize slightly smaller leading to longer chains (and the dynamic integration ime of NUTS keeps the ESS high with the additional computation cost).

Ovarian has multimodal posterior and with too short adaptation is likely to have noisy scale estimates.

Some of these have low-dimensional and/or close to normal posterior. mcycle_gp-accel_gp has far from normal and high-dimensional posterior, so I’m surprised to see it in this list.

1 Like

As you were obtaining 1e4 draws, ESS close to 1e4 means an easy posterior (close to normal and probably also low dimensional) and ESS much less than 1e3 means a difficult posterior (far from normal with non-linear dependencies / funnels / multimodality / etc and probably also higher dimensional). In case of an easy posterior ESS is not sensitive to the scaling, and in case of difficult posterior the scaling alone is not sufficient to make the posterior easier to sample.

That makes sense…

Had another look at a few of the outliers:

mcycle_gp-accel_gp: There is one chain in stan with only divergences (and a few divergences in all other chains with both samplers). If I drop the completely broken chain I get 0.00041959 ess/posterior_grad_eval in nutpie and 0.00035123 in stan, so it is no longer an outlier.

Ovarian: Here one of the nutpie chains is doing something strange:

If I drop that chain stan still does better in terms of ess/posterior_grad_evals though, and it also looks like it’s switching between the modes a bit more often:

stan posterior of lambda.1491

nutpie

Does someone know if that posterior is itself multimodal, or if only the marginal is? Looks like there’s a horseshoe prior in the model?

seeds_data-seeds_stanified_model: There is again a chain in nutpie that does something strange. Another one has two sections where it looks pretty stuck, but doesn’t report any divergences. If I drop those chains I get somewhat higher ess/posterior_grad_eval with nutpie. The traceplot of the sigma parameter in general doesn’t look too nice, not sure if that’s a model or tuning issue.

one_comp_mm_elim_abs-one_comp_mm_elim_abs: Quite a few divergences with both samplers in all chains. I think this one is just pretty clearly a model problem:

Posterior of V_m in stan:

nutpie:

I don’t think anyone else than I know this posterior well. I estimate there are about 2^9 modes What's the highest dimensional model Stan can fit using NUTS? - #5 by avehtari

Your experiments show how difficult it is to make these comparisons especially when trying really short adapatations

I just found an unfortunate complication for the comparison of the effective sample size.
It seems that nutpie and stan differ systematically in how they choose the step size.

Both should be using a target acceptance rate (or adapt_delta) of 0.8, but the actual average acceptance rate (ie trace.sample_stats.acceptance_rate.mean("draw")) looks like this for 1000 warmup steps:

@Bob_Carpenter or @avehtari: Any idea why stan would have a systematically higher than 0.8 acceptance rate?

2 Likes

It looks to me like the final step size adaptation window in stan is not long enough to overcome the bias build into the dual average method, when the step size adaptation algorithm is restarted. For instance for the “nes1972-nes” model, the warmup trace of the log10 of the step size looks like the gray lines in this plot:

Unfortunately the current best-estimate for the step size (called step_size_bar in the nutpie trace) is not saved in the stan trace, but we do have access to the value at the end of warmup because it is used as the step size later. That value is shown in blue.

The step size value that probably would lead to an acceptance rate of 0.8 should be somewhere in the middle of the narrow part of the gray band just before the last restart, but the step size that is used is in the middle of the wider (and downward-biased) band after the last restart.

This does not happen the same way in nutpie, there we don’t restart the step size adaptation in the last window. The step size and step_size_bar looks like this for the same model:

(The fact that nutpie uses a step size near 10^(-0.5), and stan uses something like 10^(-1) is just because they use different mass matrices, and shouldn’t really play a role here).

There is still a slight bias in the step sizes of nutpie toward smaller values (and correspondingly higher acceptance rates), but not nearly as much as for stan.

The effect of this seems to be that stan with adapt_delta=0.8 chooses steps sizes somewhat like nutpie would with a target of 0.9.

I’m not really sure what this means for the comparison of effecitve_sample_size/posterior_grad_eval. Unfortunately I would expect a change in the same direction as we overall see for nutpie, so this makes it hard to attribute the changes to the different mass matrix.

2 Likes

The dual averaging behavior as it interacts with the term buffer (among other things) has been discussed in some detail here Issue with dual averaging. There’s a lot of good stuff on that thread; the term buffer issue is discussed specifically beginning at post #35.

Additionally, Stan targets a conservative proxy for the acceptance statistic that should lead to negative bias in step sizes and positive bias in actual acceptance rates:

2 Likes

Thank you! Got some reading to do now :-)

1 Like

@jsocolar already pointed to the relevant old thread. That discussion was one of the many motivations to start building posteriordb. Your results using posteriordb are useful.

I think that is sensible, but you can see other arguments in that thread.

I’m still quite certain that your lower variability estimate is helping, especially in case of higher dimensional posteriors.

3 Likes

Additional discussion in the PR https://github.com/stan-dev/stan/pull/3027

4 Likes

I’m still quite certain that your lower variability estimate is helping, especially in case of higher dimensional posteriors.

I don’t mean to say that I think this implies that the different mass matrix isn’t an improvement. I think just alone from the analytical case of a multivariate normal alone that should at least be somewhat convincing (still didn’t write this fully down properly though…). I don’t know how important the lower variance is compared to the effect of the mass matrix after full adaptation. For MVN it works clearly better even if there is no estimation error for either method.

If we look at bfmi as a measure of mass matrix quality we also see an improvement, and that should I think not depend too much on the whole step size adaptation question. (just the subset of posteriordb that happens to be finished in my current run):

And difference to stan-1000

It is a bit annoying that it is so tricky to benchmark this properly. I guess I can’t really blame anyone but myself for this, by changing so many things at the same time I just made it very hard to pinpoint the exact effect of each individual change. Did make it a lot more fun to work on though. Implementing something from scratch gives quite a sense of freedom. :-)

But about the step size adaptation: nutpie is currently targeting the average metropolis acceptance probability, same as stan some years ago before Feature/2789 improved stepsize adapt target by betanalpha · Pull Request #2836 · stan-dev/stan · GitHub. Somehow I completely missed that there even was a change to this (even though pymc did the same thing). I think I’ll implement that for nutpie as well, should be pretty straight forward.

But this means that stan and nutpie target different values in their step size adaptation, and both are not fully successful in achieving it, stan apparently a bit less so. This does make comparisons tricky. In order to isolate the mass matrix target maybe it would be the easiest to either implement it for stan, or implement the stan mass matrix in nutpie, and just work around the other differences that way…

3 Likes

Just a note: that pull that you linked above was merged and then reverted due to complicated circumstances. This change is not currently a part of Stan, but is being tracked here Improve step size adaptation target · Issue #3105 · stan-dev/stan · GitHub

3 Likes