Facebook just put out a paper on “Newtonian Monte Carlo” posting it here if anyone has an opinion on it. From the abstract:

We present Newtonian Monte Carlo (NMC), a method to improve Markov Chain Monte Carlo (MCMC) convergence by analyzing the first and second order gradients of the target density to determine a suitable proposal density at each point. Existing first order gradient-based methods suffer from the problem of determining an appropriate step size. Too small a step size and it will take a large number of steps to converge, while a very large step size will cause it to overshoot the high density region. NMC is similar to the Newton-Raphson update in optimization where the second order gradient is used to automatically scale the step size in each dimension. However, our objective is not to find a maxima but instead to find a parameterized density that can best match the local curvature of the target density. This parameterized density is then used as a single-site Metropolis-Hastings proposal.
As a further improvement on first order methods, we show that random variables with constrained supports don’t need to be transformed before taking a gradient step. NMC directly matches constrained random variables to a proposal density with the same support thus keeping the curvature of the target density intact.
We demonstrate the efficiency of NMC on a number of different domains. For statistical models where the prior is conjugate to the likelihood, our method recovers the posterior quite trivially in one step. However, we also show results on fairly large non-conjugate models, where NMC performs better than adaptive first order methods such as NUTS or other inexact scalable inference methods such as Stochastic Variational Inference or bootstrapping.

I’m bad at math, but it looks like they need the analytical derivatives for each proposal distribution but can still just use autodiff for the models log prob calculation?

Additionally, they are likely running PyTorch on GPU, so for fat models with lots of datapoints, comparing to single core Stan, walltime comparison is a disengenuous benchmark. There’s probably ~100x difference just in evaluating the gradient of lp, especially if they use single precision. If they were to normalize for hardware differences the Neff/s may not be impressive enough for a publication. It would also seem pertinent to compare to a Stan RHMC implementation?

Thanks for the refs. @arya and @bbbales2 were thinking about better stiffness handling in integrators for Stan and it’s always enlightening to hear from @betanalpha on these topics.

P.S. In general, you can use autodiff anywhere you would otherwise have an analytic gradient if you have a rich enough autodiff system. It might be expensive, especially if it’s nested inside other autodiff, but I think it’s always possible.

Stripping away their unfortunate notation what they propose is a Metropolis-Hastings within block Gibbs sampler using a regularized Hessian as a local Gaussian approximation. The only novel contribution is using the conditional mean of that Gaussian approximation to define the step (similar to “Riemannian” Langevin methods) instead of taking a random Gaussian sample. They don’t go into many details, but because they define a deterministic proposal they would need to do some very careful corrections in order to get a well-defined Metropolis-Hastings correction. For comparison the “Riemannian” Langevin methods have noise around the gradient update and still requires a nasty Hastings ratio in the acceptance probability.

Regardless of those details, the previous comments about the poor evaluations metrics are spot on. To avoid underlying implementation differences they would need to compare something like effective sample size per equivalent gradient evaluation, where equivalent gradient evaluation would be something like N * iterations for this new method to account for the cost of computing the Hessian. The fact that they are using GPUs on very parallelizeable examples and only getting a factor of 5 improvement in effective sample size per walltime does not bode well for the actual performance of the method.

On a side note, I’m not sure what the default “Pyro” method is, but the fact that it’s that slow compared to Stan-on-CPU is honestly pretty shocking.