New algorithm: Gradient-based Adaptive Markov Chain Monte Carlo

Red-Portal · June 6, 2020, 1:44pm

Hi everyone,
I know that the Stan forum is spammed by every MCMC algorithms published ever.
But I think this one, “Gradient-based Adaptive Markov Chain Monte Carlo” is quite interesting.
It was published last year at NIPS.
The authors propose a new way to adapt full-rank Gaussian proposals for MALA using variational inference.
A comparison against NUTS is also provided, showing significant improvements in the ESS/min (About x2, x3).
Opinions?

ldeschamps · June 6, 2020, 3:34pm

Hi!

Sorry, my comment is not pertinent, but I wanted to share it.

It makes me laugh that every new algorithm paper try to compete with NUTS, because they point unwillingly that NUTS may be the very best general-purpose algorithm for full bayesian inference out there.

And science is consensus ahah!

Lucas

roualdes · June 6, 2020, 4:15pm

I like the paper. Thanks for sharing. Cool idea. I especially like to see the interplay between the variational inference and MCMC spheres.

Just to help make it more explicit, in terms of Stan’s implementation of HMC, the paper uses NUTS with an identity metric (“unit_e”) as implemented at https://github.com/aki-nishimura/NUTS-matlab. They also use a different calculation for ESS than Stan does, hence their max ESS tops out at the number of samples, see Table 1.

roualdes · June 6, 2020, 5:28pm

Here’s an attempt at a more formal (but not formal) comparison to Stan, using their example labelled Neal’s Gaussian target. This is a zero-mean multivariate Gaussian with diagonal covariance matrix \Sigma = diag(s^2_1,...,s^2_{100}) where s_i = i/100 for i = 1:100.

In an attempt to replicate their implementation of NUTS, I ran Stan with metric = “unit_e” and then all the numbers they report in the paper: num_warmup = 500, num_samples = 2*10^4, and chain = 1 (a guess). Replicating their implementation of NUTS should help compare across machines. It’s not clear what hardware they’re running. I’ll report my system details below. They report the average of 10 runs, and I just have 1 run. Then I ran Stan with the same iteration and chain numbers, but the default metric = “diag_e”.

Borrowing parts of their Table 1 and appending two rows, I got

Method	Time	Min ESS	Min ESS / second
gadMALAf	8.7	1413.4	161.7
NUTS	360.5	18479.6	51.28
Stan (unit_e)	669	18929	28.3
Stan (diag_e)	29.5	15758	534

I’m running macOS Catalina 10.15.5 with a 2.7 GHz 12-Core Intel Xeon E5 processor and 64 GB 1866 MHz DDR3 memory. I used R v4.0, CmdStan v2.23.0, and CmdStanR v0.0.0.9000.

Red-Portal · June 7, 2020, 1:29pm

That evaluation is really nice, thank you very much for sharing.
Since their implementation is based on MATLAB (which is horribly slow), I think it wouldn’t be fair to directly compare with Stan.
I think the C++ advantage would give Stan about x5 ~ x10 speedup?
Considering that, I guess the ESS/second of the proposed method could still be comparable with NUTS.

ahartikainen · June 7, 2020, 3:26pm

It depends on what parts of calculations take time. For loops are probably jitted and linear algebra runs on C/C++/Fortran.

anon75146577 · June 7, 2020, 5:59pm

Lol. It’s a bit cheeky to compare an adaptive MCMC method with a non-adaptive variant of NUTS on a problem that’s going to require some scaling.

Red-Portal · June 8, 2020, 6:45am

I think NUTS can be seen as an adaptive method by itself (Hence the word “adaptively setting paths” in the original paper’s title).
But yes they should have compared with a non-unit preconditioner to be more fair.
However, judging from the results, I think it will still be quite competetive.

anon75146577 · June 8, 2020, 1:02pm

NUTS is not an adaptive MCMC method in the usual sense (there isn’t a family of transition kernels that are being selected from at each step)

Bob_Carpenter · July 7, 2020, 7:40pm

That’s why we like to compare on number of log density and gradient evaluations. There’s no rule of thumb to measure computational differences. I can get a 5 - 10 times speedup of naive C++ code just (OK, it’s not quite trivial) by exploiting SSE or AVX friendly code, for example.

Unless the variables are all unit scale, not preconditioning makes a huge difference for NUTS. The factor depends on the condition, of course.

NUTS estimates the metric and step size during warmup. Then it selects the integration time (nubmer of steps) during sampling on an iteration-by-iteration basis. This was all in the original Hoffman and Gelman paper, but we’ve moved on from the specifics of that paper in a number of ways.

xiehw · July 7, 2020, 7:44pm

Is Stan’s current sampler algorithm still the XMC described in Betancourt 2016 (https://arxiv.org/pdf/1601.00225.pdf), or has it been further updated already?

Bob_Carpenter · July 8, 2020, 8:36pm

I don’t think Stan ever used the exhaustive HMC (XHMC) from @betanalpha’s paper. I think he evaluated it and wound up modifying NUTS instead. The key point in the conclusion I believe is still true:

The mixed performance of the two algorithms shows that neither criteria is able to robustly identify optimal integration times in all cases and suggests that better termination criteria can still be developed.

betanalpha · July 9, 2020, 2:32am

The current implementation of Hamiltonian Monte Carlo in Stan, modulo the current adaptation procedures, is largely described in Appendix A.5 of https://arxiv.org/abs/1701.02434. The only major change since then has been additional termination checks to avoid holes that can sometimes arise in discrete trajectories.

Red-Portal · February 16, 2022, 9:41pm

For anybody interested, this idea has been extended to Hamiltonian Monte Carlo.
link:

Topic		Replies	Views
Reference for current version of NUTS/HMC in stan? Algorithms	10	1766	March 15, 2022
Comparing Stan's adaptation phase to that of nuts-rs? Algorithms	20	1647	August 11, 2023
HMC (jittered) vs. NUTS on 1000-dimensional standard normal Algorithms mcmc	9	3894	April 29, 2019
Comparing implementations of mixed logit Bayesian inference General	3	1047	July 14, 2020
New Algorithm: Newtonian Monte Carlo paper Algorithms mcmc	7	2133	February 7, 2020

New algorithm: Gradient-based Adaptive Markov Chain Monte Carlo

Related topics