I have a question: how should we be evaluating samplers to convince ourselves they’re better than our current default sampler (multinomial NUTS) either across the board, or for specific problem types.
The reason this came up is that there are a bunch of proposals on the table with various goals that I want to evaluate:
-
Nutpie: This is Adrian Seyboldt’s sampler that is now the default sampler in PyMC. It uses scores (gradients of log density) in addition to values, so potentially adapts faster than Stan.
-
Micro-canonical HMC: this is Jakob Robnik and Uroš Seljak’s sampler that is biased, but appears to have lower error than Stan up to a few thousand iterations and in fewer gradient evaluations. They along with Ruben Cohn-Gordon are looking into the Metropolis adjusted version as well.
-
Gibbs self tuning: This is the sampler that Nawaf Bou-Rabee, Milo Marsden, Tore Kleppe, Sifan Liu, Chirag Modi, and I have been working on in various forms. It does auto stepsize tuning the same way that NUTS does number of steps tuning.
-
RealNVP normalizing flows: This is a general variational technique, but the tweaks from Abhinav Agrawal, Justin Domke, and Dan Sheldon make it competitive with MCMC or better in the evaluations Justin and I did while he was visiting.
Plus, whatever the next person sends us saying it should go into Stan.
Gradient evals to ESS > 100 for all parameters
This is the main thing I personally care about. From a cold start, how do we get to ESS > 100 for all parameters?
ESS per gradient eval after warmup
After we warmup and start sampling, we can measure ESS per gradient
Cost to warmup in gradient evals
This topic is a question—I’m posing it in the Developers list, but I’m happy to hear from anyone. There’s the issue of measuring convergence of warmup here.
Square error on \theta, \theta^2, \theta^\top \cdot \theta
Given a target density p(\theta) and either reference draws or reference moments, evaluate the square error in estimating the expectations of \theta, \theta^2, \theta^\top \cdot \theta. This may be the best way to back out the ESS, because estimating ESS is noisy.
Proper \mathcal{O}(1 / \sqrt{n}) scaling
I’m always setting up code to take a simple known distribution and make sure that standard errors drop at a rate of 1 / \sqrt{n} in number of draws n. this is more a validation that the algorithm works. Biased algorithms clearly fail—they drop at roughly the given rate, then asymptote at their bias.
Your evaluation here
…