New Algorithm: Newtonian Monte Carlo paper

stevebronder · December 17, 2019, 5:16pm

Facebook just put out a paper on “Newtonian Monte Carlo” posting it here if anyone has an opinion on it. From the abstract:

We present Newtonian Monte Carlo (NMC), a method to improve Markov Chain Monte Carlo (MCMC) convergence by analyzing the first and second order gradients of the target density to determine a suitable proposal density at each point. Existing first order gradient-based methods suffer from the problem of determining an appropriate step size. Too small a step size and it will take a large number of steps to converge, while a very large step size will cause it to overshoot the high density region. NMC is similar to the Newton-Raphson update in optimization where the second order gradient is used to automatically scale the step size in each dimension. However, our objective is not to find a maxima but instead to find a parameterized density that can best match the local curvature of the target density. This parameterized density is then used as a single-site Metropolis-Hastings proposal.
As a further improvement on first order methods, we show that random variables with constrained supports don’t need to be transformed before taking a gradient step. NMC directly matches constrained random variables to a proposal density with the same support thus keeping the curvature of the target density intact.
We demonstrate the efficiency of NMC on a number of different domains. For statistical models where the prior is conjugate to the likelihood, our method recovers the posterior quite trivially in one step. However, we also show results on fairly large non-conjugate models, where NMC performs better than adaptive first order methods such as NUTS or other inexact scalable inference methods such as Stochastic Variational Inference or bootstrapping.

https://openreview.net/pdf?id=SklKcJ3EFH

I’m bad at math, but it looks like they need the analytical derivatives for each proposal distribution but can still just use autodiff for the models log prob calculation?

daniel_h · December 17, 2019, 5:33pm

I’m not competent enough to have an opinion on this, but I thought it was better to use something like Neff/s for benchmarks?

maedoc · December 18, 2019, 8:56am

better to use something like Neff/s

Additionally, they are likely running PyTorch on GPU, so for fat models with lots of datapoints, comparing to single core Stan, walltime comparison is a disengenuous benchmark. There’s probably ~100x difference just in evaluating the gradient of lp, especially if they use single precision. If they were to normalize for hardware differences the Neff/s may not be impressive enough for a publication. It would also seem pertinent to compare to a Stan RHMC implementation?

avehtari · December 20, 2019, 7:44pm

With quick reading I wasn’t yet convinced.
There is also some overlap with https://epubs.siam.org/doi/abs/10.1137/110845598

Bob_Carpenter · January 23, 2020, 10:42pm

Thanks for the refs. @arya and @bbbales2 were thinking about better stiffness handling in integrators for Stan and it’s always enlightening to hear from @betanalpha on these topics.

P.S. In general, you can use autodiff anywhere you would otherwise have an analytic gradient if you have a rich enough autodiff system. It might be expensive, especially if it’s nested inside other autodiff, but I think it’s always possible.

betanalpha · January 24, 2020, 5:02pm

Stripping away their unfortunate notation what they propose is a Metropolis-Hastings within block Gibbs sampler using a regularized Hessian as a local Gaussian approximation. The only novel contribution is using the conditional mean of that Gaussian approximation to define the step (similar to “Riemannian” Langevin methods) instead of taking a random Gaussian sample. They don’t go into many details, but because they define a deterministic proposal they would need to do some very careful corrections in order to get a well-defined Metropolis-Hastings correction. For comparison the “Riemannian” Langevin methods have noise around the gradient update and still requires a nasty Hastings ratio in the acceptance probability.

Regardless of those details, the previous comments about the poor evaluations metrics are spot on. To avoid underlying implementation differences they would need to compare something like effective sample size per equivalent gradient evaluation, where equivalent gradient evaluation would be something like N * iterations for this new method to account for the cost of computing the Hessian. The fact that they are using GPUs on very parallelizeable examples and only getting a factor of 5 improvement in effective sample size per walltime does not bode well for the actual performance of the method.

On a side note, I’m not sure what the default “Pyro” method is, but the fact that it’s that slow compared to Stan-on-CPU is honestly pretty shocking.

Michalis_Titsias · February 2, 2020, 6:13pm

Related work: https://tminka.github.io/papers/qi-minka-HMH-AMIT-02.pdf

Bob_Carpenter · February 7, 2020, 11:56pm

Thanks for posting. I fear that approach is still going to have the diffusion problems in high dimensions that are typical of Metropolis.

Topic		Replies	Views
New algorithm: Gradient-based Adaptive Markov Chain Monte Carlo Algorithms mcmc	13	2050	February 16, 2022
Tried to Speed Up Stan (HMC) by Rewriting Dynamics, failed Algorithms	2	664	June 17, 2020
Standalone HMC sampler Algorithms mcmc	7	2259	May 3, 2018
Adaptive MC w/ normalizing flows Algorithms techniques	1	628	November 22, 2022
NUTS inside a Sequential Monte Carlo sampler Algorithms	12	1834	August 10, 2021

New Algorithm: Newtonian Monte Carlo paper

Related topics