How to evaluate samplers for inclusion in Stan?

I have a question: how should we be evaluating samplers to convince ourselves they’re better than our current default sampler (multinomial NUTS) either across the board, or for specific problem types.

The reason this came up is that there are a bunch of proposals on the table with various goals that I want to evaluate:

  • Nutpie: This is Adrian Seyboldt’s sampler that is now the default sampler in PyMC. It uses scores (gradients of log density) in addition to values, so potentially adapts faster than Stan.

  • Micro-canonical HMC: this is Jakob Robnik and Uroš Seljak’s sampler that is biased, but appears to have lower error than Stan up to a few thousand iterations and in fewer gradient evaluations. They along with Ruben Cohn-Gordon are looking into the Metropolis adjusted version as well.

  • Gibbs self tuning: This is the sampler that Nawaf Bou-Rabee, Milo Marsden, Tore Kleppe, Sifan Liu, Chirag Modi, and I have been working on in various forms. It does auto stepsize tuning the same way that NUTS does number of steps tuning.

  • RealNVP normalizing flows: This is a general variational technique, but the tweaks from Abhinav Agrawal, Justin Domke, and Dan Sheldon make it competitive with MCMC or better in the evaluations Justin and I did while he was visiting.

Plus, whatever the next person sends us saying it should go into Stan.

Gradient evals to ESS > 100 for all parameters

This is the main thing I personally care about. From a cold start, how do we get to ESS > 100 for all parameters?

ESS per gradient eval after warmup

After we warmup and start sampling, we can measure ESS per gradient

Cost to warmup in gradient evals

This topic is a question—I’m posing it in the Developers list, but I’m happy to hear from anyone. There’s the issue of measuring convergence of warmup here.

Square error on \theta, \theta^2, \theta^\top \cdot \theta

Given a target density p(\theta) and either reference draws or reference moments, evaluate the square error in estimating the expectations of \theta, \theta^2, \theta^\top \cdot \theta. This may be the best way to back out the ESS, because estimating ESS is noisy.

Proper \mathcal{O}(1 / \sqrt{n}) scaling

I’m always setting up code to take a simple known distribution and make sure that standard errors drop at a rate of 1 / \sqrt{n} in number of draws n. this is more a validation that the algorithm works. Biased algorithms clearly fail—they drop at roughly the given rate, then asymptote at their bias.

Your evaluation here

6 Likes

Be more flexible than just “inclusion in Stan”

Beyond the technical considerations. I think a structural change could also help make the process of evaluation a bit lower stakes. If CmdStan and the interfaces built on top had a bit more modularity, new samplers could be “plugins” to be potentially added by users. We could then have more tiers than just “in Stan” and “not in Stan”.

A simple version would be that each new sampler would start as a purely experimental plugin, not listed anywhere in official Stan materials and with no explicit guarantees beyond running and integrating with the main Stan/CmdStan codebase. Some of those would then be checked for some level of correctness and promoted to “beta” which would be listed on a separate section on the site/docs and still require user to actively install them in addition to core Stan. Beta samplers would hopefully be tested by small but non-negligible fraction of Stan users on a wide variety on models and if the experience is positive we could move to inclusion in core Stan. Obviously, there could be more tiers if that made more sense.

SBC

On a separate note, I also think a thorough evaluation with SBC on a bunch of models within the target problem class should be a criterion.

4 Likes

Hi, @martinmodrak and thanks for the suggestions.

I’m interested in the problem in general, so happy to get more general feedback. I only framed it this way to try to focus the discussion here around what’s relevant for Stan decision making.

I’m not sure how we’d do this, but then I’m not the best person to talk to about C++ linking. Did you have something specific in mind? The way to get this kind of feature into Stan is to write a design doc. It’d have to somehow link to how you’d specify a sampler in the interfaces in addition to the very brittle way it’s done now in CmdStan now.

Right now, the easiest way to use the Stan language with a new sampler is through BridgeStan. Nutpie provides a Stan interface this way. This is also what we’re doing around GIST. You can still use posterior in R or ArviZ in Python.

The advantage to getting them into Stan in an easier to use uniform interface.

I think we’d need the original form from Cook that had a hypothesis test for uniformity. I don’t see how we could do the Talts et al. version as it’s all subjective by-eye evaluations on a dimension-by-dimension basis.

In some sense, I’m hoping we don’t need this if we have reference problems with reference answers. I am also interested in measuring samples that may have bias, which are going to fail SBC (e.g., various forms of VI or unadjusted MCMC methods).

1 Like

One of my favorite things about Stan is the emphasis on verifying (to the extent possible) the correctness of the inference. In this sense, I’d love to see Stan continue to prioritize samplers that furnish useful convergence diagnostics like divergences in HMC. I don’t know if this is of core relevance for deciding whether to include a given sampler in Stan, but I do think it’s worth injecting into the conversation as an important desideratum for samplers in general, irrespective of their performance on specific reference problems. When a sampler fails, how likely is it to fail noisily?

3 Likes

I like the list you proposed to evaluate samplers.

I think an important additional question is which models should be used to evaluate the samplers. PosteriorDB is great, but I think it could use a couple more large models. Most models in there don’t have that many parameters. There are also a few models (those involving horseshoe priors, for instance) that I don’t think converge with any sampler I’ve seen. Adding several larger converging brms models would be quite helpful.

If Stan were to add more samplers, I think it would be important to ensure old results can still be reproduced. Perhaps a “sampler preset” could be introduced? Each sampler could have a name and a version. For each update to the algorithm that changes the results, the version could be incremented, but the old sampler would still be available. If all sampler output always includes the name and version of the sampler, users would not suddenly end up with broken chains if they rerun old code.

I had considered adding something like this to Nutpie but hadn’t done so because it would have involved significant work to keep old versions consistently available. However, given the relatively small number of sampler changes in Stan over the years, maybe this would be doable?

A small comment from a very novice user perspective, if stan were to include new sampler, i think it would be nice to have clear guidelines on when to chose each specific sampler.

As a very user with limited bayesian knowledge, i have to admit that it can be already confusing to choose between the different alrogithm proposed in stan (MCMC Sampling) and i don’t think there is clear guidelines to say when to use Pathfinder instead of MCMC or Lapace approximation.

1 Like

I would like to second @jsocolar comment about failing loudly. Over the past year, I’ve really discovered how amazingly useful the loud diagnostics of HMC really are in practice. Maybe you have a new algorithm like Gibbs self tuning that explores some pinched geometry better than current NUTS without the divergences and no warnings. I wouldn’t consider this as necessarily better in real-world practice. It all depends. In practice, I may want to be alerted that these pinches exist and make changes to my model.

5 Likes

Just to add two more things:

  • When benchmarking nutpie, I noticed that the variance between runs with both nutpie and stan was pretty high, even with 10 chains. So maybe another criterion would be variance between chains? A good hmc sampler for instance should have little variance in step size, trajectory length and mass matrix between chains after warmup.

  • ESS / grad eval is usually a good measure, but for instance the new normalizing flow nutpie sampler often spends most time outside gradient evaluations. So ESS / time should also clearly be on the list…

I’ll reply to a bunch of folks at once.

We keep all of our old releases around, which feels like a much easier way to ensure reproducibility of old results. Even if we keep the samplers the same, we can’t get bit-level reproducibility without using exactly the same systems.

Absolutely. I should’ve said that, too—I’m open to suggestions on specifics. One of the things we’re going to do is build a large JAX-based evaluation set and also update posteriordb. I have the ability from Flatiron to host large files if we have bigger things that need to store reference draws.

I agree. This is one of the reason we haven’t gone down the rabbit hole of introducing a wide choice of samplers. In my experience, if we have N samplers, we multiply a lot of people’s work by N because they’re going to try them all.

We could definitely improve along these lines. Part of the problem is that we don’t have a crisp characterization of when the different systems work. For these, I can say:

  • Pathfinder will work well in log concave distributions where the posterior is roughly normal; the main goal for Pathfinder was to get reasonable inits for harder problems, not to be a replacement for MCMC
  • Laplace approximation will also work well if the posterior is roughly normal. This will fail if the posterior doesn’t have a mode, like in a hierarchical model. Pathfinder can handle that situation, as we saw in the paper, but it doesn’t give a great posterior approximation (though does find the bulk of the probability mass).
  • MCMC is what you should be using whenever you have the compute for it.

There are two things going on. When we fail, we return errors. What we’re often doing is raising warnings, like divergences, which don’t necessarily imply that your answers are wrong.

Our goal in building a more robust HMC sampler is to eliminate the divergences. You’re right that reparameterizing will be a more robust solution when possible and should also be faster. The problem is that it’s really really hard in complicated models. And it’s too hard for most of our users. It’s also a trial-and-error problem in that, for example, the centered parameterization is better with well-identified parameters (strong data plus prior) than with poorly identified ones (weaker prior, smaller data), and the non-centered parameterization in other situations. We have lots of hierarchical models with collections of varying effects where centered priors are better for some and non-centered for others—it’s not just about overall amount of data.

Perhaps we can look at something like condition number? But that’s expensive as it requires a quadratic Hessian construction and cubic solve. You can also diagnose areas of high curvature, which is what I assume you mean by “pinches” by looking at the variation in log density. In the funnel with the centered parameterization, you get huge swings in log density over the posterior. We know this is bad because each iteration is bounded in the log density change through random momentum assignment.

I’m also not sure where to draw the line. Drawing it at what vanilla HMC can sample feels arbitrary. If your model has really pleasant geometry, Metropolis will work. If it’s slightly worse, but with axis aligned scale variation, Gibbs can work.

2 Likes

Robustness to highly correlated posteriors, like nested random effects and Gaussian process priors, things like that.

Nothing is more annoying than working with a sampler that was evaluated on some cute simulated data.

I probably focused too narrowly on the technical part - since this is tangential, I made an outline at Idea: A simple plugin system for CmdStan . But whatever the tech, I think a goal should be that a sampler that is in beta is implemented and documented in such a way that a substantial fraction of Stan users could (if they wanted) try the sampler for their problem with a reasonably short setup (say <30 minutes). This could totally be achievable with a BridgeStan implementation.

That’s not true, therre are many known tests for discrete uniformity. The gamma statistic from [2103.10522] Graphical Test for Discrete Uniformity and its Applications in Goodness of Fit Evaluation and Multiple Sample Comparison is particularl useful IME (and the tail quantiles of the null distribution can be evaluated to use it in a test)

True, but even for the approximate ones, it would IMHO be good to know the extent of the miscalibration. I don’t want to push SBC too strongly. It is definitely not a hard requirement IMHO.

1 Like

Talts et al. (https://arxiv.org/pdf/1804.06788) intentionally avoid framing the calibration tests as hypothesis tests and rather suggest inspecting output by eye. From their paper:

There are many ways of testing the uniformity of the rank statistics, but the SBC procedure,outlined in Algorithm 1, exploits a histogram of rank statistics for a given random variable to enable visual inspection of uniformity (Figure 3).

From talking to @andrewgelman, I think this is partly due to his dislike of hypothesis tests, and partly due to the point that @martinmodrak brings up:

From the paper,

Fortunately, the structure of the Bayesian joint distribution allows for the validation of any Bayesian computational method capable of producing samples from the posterior distribution, or an approximation thereof. This includes not only Monte Carlo methods but also determin- istic methods that yield approximate posterior distributions amenable to exact sampling, such as integrated nested Laplace approximation (INLA) (Rue, Martino and Chopin, 2009; Rue et al., 2017) and automatic differentiation variational inference (ADVI)

You have to go back to Cook’s paper (https://www.tandfonline.com/doi/abs/10.1198/106186006X136976) to find any mention at all of hypothesis testing. She tests for uniformity by composing an inverse CDF and test for normality, but there are a lot of alternative tests that could be used.

I have a slightly different but relevant situation. I’m working on a new algorithm outside of Stan (proof of concept coded in alternative language) and want to test it on some big, hard models. Where did folks end with this discussion? I’m also curious to see if there are models that would make good cases for testing performance with increasing dimensionality. Something like maybe a binomial GLMM? If there are any good ones around I’d be curious to hear. Thanks!

1 Like