Hi everyone,
I know that the Stan forum is spammed by every MCMC algorithms published ever.
But I think this one, “Gradient-based Adaptive Markov Chain Monte Carlo” is quite interesting.
It was published last year at NIPS.
The authors propose a new way to adapt full-rank Gaussian proposals for MALA using variational inference.
A comparison against NUTS is also provided, showing significant improvements in the ESS/min (About x2, x3).
Opinions?
Hi!
Sorry, my comment is not pertinent, but I wanted to share it.
It makes me laugh that every new algorithm paper try to compete with NUTS, because they point unwillingly that NUTS may be the very best general-purpose algorithm for full bayesian inference out there.
And science is consensus ahah!
Lucas
I like the paper. Thanks for sharing. Cool idea. I especially like to see the interplay between the variational inference and MCMC spheres.
Just to help make it more explicit, in terms of Stan’s implementation of HMC, the paper uses NUTS with an identity metric (“unit_e”) as implemented at https://github.com/aki-nishimura/NUTS-matlab. They also use a different calculation for ESS than Stan does, hence their max ESS tops out at the number of samples, see Table 1.
Here’s an attempt at a more formal (but not formal) comparison to Stan, using their example labelled Neal’s Gaussian target. This is a zero-mean multivariate Gaussian with diagonal covariance matrix \Sigma = diag(s^2_1,...,s^2_{100}) where s_i = i/100 for i = 1:100.
In an attempt to replicate their implementation of NUTS, I ran Stan with metric = “unit_e” and then all the numbers they report in the paper: num_warmup = 500, num_samples = 2*10^4, and chain = 1 (a guess). Replicating their implementation of NUTS should help compare across machines. It’s not clear what hardware they’re running. I’ll report my system details below. They report the average of 10 runs, and I just have 1 run. Then I ran Stan with the same iteration and chain numbers, but the default metric = “diag_e”.
Borrowing parts of their Table 1 and appending two rows, I got
Method | Time | Min ESS | Min ESS / second |
---|---|---|---|
gadMALAf | 8.7 | 1413.4 | 161.7 |
NUTS | 360.5 | 18479.6 | 51.28 |
Stan (unit_e) | 669 | 18929 | 28.3 |
Stan (diag_e) | 29.5 | 15758 | 534 |
I’m running macOS Catalina 10.15.5 with a 2.7 GHz 12-Core Intel Xeon E5 processor and 64 GB 1866 MHz DDR3 memory. I used R v4.0, CmdStan v2.23.0, and CmdStanR v0.0.0.9000.
That evaluation is really nice, thank you very much for sharing.
Since their implementation is based on MATLAB (which is horribly slow), I think it wouldn’t be fair to directly compare with Stan.
I think the C++ advantage would give Stan about x5 ~ x10 speedup?
Considering that, I guess the ESS/second of the proposed method could still be comparable with NUTS.
It depends on what parts of calculations take time. For loops are probably jitted and linear algebra runs on C/C++/Fortran.
Lol. It’s a bit cheeky to compare an adaptive MCMC method with a non-adaptive variant of NUTS on a problem that’s going to require some scaling.
I think NUTS can be seen as an adaptive method by itself (Hence the word “adaptively setting paths” in the original paper’s title).
But yes they should have compared with a non-unit preconditioner to be more fair.
However, judging from the results, I think it will still be quite competetive.
NUTS is not an adaptive MCMC method in the usual sense (there isn’t a family of transition kernels that are being selected from at each step)
That’s why we like to compare on number of log density and gradient evaluations. There’s no rule of thumb to measure computational differences. I can get a 5 - 10 times speedup of naive C++ code just (OK, it’s not quite trivial) by exploiting SSE or AVX friendly code, for example.
Unless the variables are all unit scale, not preconditioning makes a huge difference for NUTS. The factor depends on the condition, of course.
NUTS estimates the metric and step size during warmup. Then it selects the integration time (nubmer of steps) during sampling on an iteration-by-iteration basis. This was all in the original Hoffman and Gelman paper, but we’ve moved on from the specifics of that paper in a number of ways.
Is Stan’s current sampler algorithm still the XMC described in Betancourt 2016 (https://arxiv.org/pdf/1601.00225.pdf), or has it been further updated already?
I don’t think Stan ever used the exhaustive HMC (XHMC) from @betanalpha’s paper. I think he evaluated it and wound up modifying NUTS instead. The key point in the conclusion I believe is still true:
The mixed performance of the two algorithms shows that neither criteria is able to robustly identify optimal integration times in all cases and suggests that better termination criteria can still be developed.
The current implementation of Hamiltonian Monte Carlo in Stan, modulo the current adaptation procedures, is largely described in Appendix A.5 of https://arxiv.org/abs/1701.02434. The only major change since then has been additional termination checks to avoid holes that can sometimes arise in discrete trajectories.
For anybody interested, this idea has been extended to Hamiltonian Monte Carlo.
link: