How to compare different algorithms for Bayesian inference in terms of speed (and subject to effectiveness)

This has been discussed before as parts of other topics, most recently
Blog: A gentle Stan to INLA comparison, but I think it will be good to have it all in one place.

Let’s say we have different types algorithms for Bayesian inference: MCMC-based (HMC, M-H, Gibbs…), structural approximation (INLA…) and hardware-accelerated (multicore/GPU) variants.

We want to compare them (in terms of computation speed, but we can’t do that without also looking at “accuracy”) on a specific model and fixed input size. We are interested in how this changes as input size grows, but I assume that won’t be the difficult part.

Its a “soft” question and there is not a “best” solution, so the goal is to maybe just come up with a procedure that most people wouldn’t have any major objections if they saw someone doing it like this in a scientific publication.

For now, just a few observations/questions to start the discussion:

  • Assuming that you have a good procedure for comparing algorithms, hardware-accelerated variants are in most cases trivial to deal with - the only difference will be that they take (hopefully) less time to compute. Running independent MCMC chains and pooling the samples is an exception, because the results will be different (and possibly much worse than running a single chain). In this context I see it as a different type of MCMC-based inference and it should be compared as such.

  • For comparing different MCMC-based algorithms we seem to be the closest to reaching a consensus Run them for the same amount of time and compare effective sample size [ESS] (or run them until they reach a particular ESS and compare the times):

  • How do we deal with tune-able parameters such as the proportion of warmup samples? Technically, we could find the settings that maximizes ESS (minimizes time), but I don’t think that’s how people would do it in practice (= run it with defaults until something goes wrong) and it would probably take even longer than just running the initial chain for a long time.

  • How reliable is ESS for this purpose? There are different ways of estimating ESS, but they are all based on the assumption that the chain has converged? Intuitively (and from practical experience), I’d say it can fail miserably in some cases.

  • ESS can’t be used if something like INLA is included in the comparison. As some have done, we can replace ESS with how well the model estimates the parameters (mean squared error of the estimated means, for example), but that can only be done if the true values of the parameters are known. Or, we can measure overall goodness-of-fit, with (approximate) cross-validation or test data (but this becomes a problem with temporal and spatial data).

1 Like
  • I wrote a little about this here: http://andrewgelman.com/2017/10/20/barry-gibb-came-fourth-barry-gibb-look-alike-contest/

  • If software has defaults, I would use them for a comparison. It might also be worth having an “optimally tuned” example as well, but I think that given most people will run software with the default settings, the performance under those conditions is important.

  • For MCMC, ESS/time is a good metric

  • For general algorithms, you need to pick a metric. In this case, I’m not sure normalizing by time is useful (at least, I can’t see how you’d do it). It makes more sense to report performance under metric(s), wall time, and convergence diagnostics (eg, for MCMC I’d report ESS, R-hat, divergences for HMC, BFMI etc).

  • The link above covers comparing spatial problems (same for time).

  • It’s important to remember that if you’re running under simulated data, the timings (and all other quantities derived from the algorithm and the posterior) are all random and should be reported with an interval.

3 Likes

Given that ESS is an estimate and there are many estimators, I’d recommend our version that discounts for non-converged multiple chains.

This has been discussed in another thread. It makes me want to change our defaults. We wind up looking slow in easy problems, which is usually what people use for comparison as they’re all you can implement on a bunch of different systems.

That’s why we’ve been conservative in warmup size and number of iterations. Most simple problems could run on a quarter as many quite reliably. So we look bad on simple models in comparisons as a result, since people often just compare wall time, not even ESS/time.

The answer also depends on why you’re doing the comparison. If it’s “how much performance can I squeeze out of Stan” then you are going to need to tune. If it’s “how fast can I run this standard model”, then maybe.

I just wish the conclusions weren’t “Stan is faster than Y” (or vice-versa), but rather “for the data sizes and models we tested, using settings X, Stan is faster than Y as measured by Z”.

1 Like

Yes! The only really meaningful timing comparisons are “order of magnitude” comparisons. (Is it instant, cup of coffee, lunch break, over-night, or longer)

The other question that’s more Stan-centric is if the compilation time should be included in the timings. It’s a large fixed cost and if you’re only running a model once, it’s a relevant number.

1 Like

It’s also just a huge qualitative drag on model development. We need an interpreter!

For small models, we don’t need all the optimization, so we can compile about twice as fast if we really wanted to optimize compile plus run time for small models. We went for robustness and targeted bigger models.

I’d agree that this works most of the time, but I think it breaks down at the extremes. For very time-sensitive applications, where every % counts (although probably not the area where you would apply something like MCMC anyway), and for very long computation times, where a 50% improvement can save you days or even weeks of computation.

I was thinking about something like this (assume we have some metric and that we can simulate as many datasets of a certain size as we need): We pick some time t and run the algorithm many times for exactly (or approximately) time t, which gives us an estimate of the expectation of metric(t) (without most of the uncertainty that comes from the randomness in the algorithm and data).

Then we repeat this for different (increasing) t and metric(t) should converge to something (at least for these kind of algorithms I would expect that they don’t deteriorate with more computation time; and if it does, the model has bigger problems and/or no limit distribution). If we plot this curve for each algorithm, we can at least visually compare them. For example, INLA has more or less constant computation time, so we can see how much time it takes MCMC to obtain similar quality, how much worse it is with same computation time or how much better/worse at “convergence”.

I think ESS/time is a good general metric, but at the same time something that you would never be directly interested in (or trying to maximize).

This implies that multiple chains have to be run? You mean for empirical evaluation purposes only - run more chains for each algorithm to get a better estimate of the ESS of a single chain?

Whenever I hear running multiple chains (or taking advantage of multiple cores to speed up MCMC computation), I always remember Geyer’s rant: One Long Run (of course, it’s still better to have 8 chains of length n than one chain of length n, if it takes the same amount of time).

How often would you run a model only once? In model development as @Bob_Carpenter mentioned, but I don’t need empirical evaluation to conclude that it can be very annoying that Stan models take so much time to compile. :)

But it doesn’t take much effort to include compilation times, so why not.

Just a warning – effective sample size will not, in general, be meaningful unless your Markov chains are well-behaving. For hard problems your Marvov chain Monte Carlo estimators may be biased, and that bias won’t in general manifest in the effective sample size (unless you use the heuristic cross chain effective sample size estimator in Stan).

Remember that deep down all we’re doing is estimating the expectations with respect to a given target distribution. That means there are only two numbers that can enter into a comparison metric – the error of an estimator relative to the true expectation and some measure of computational effort.

The error will, in general, be very hard to quantify for many algorithms. For MCMC if we can verify that a central limit theorem holds then the effective sample size (or rather it’s inverse square root) quantifies how the error of the resulting MCMC estimators scales. But verifying a central limit theorem theoretically for a given problem is notoriously hard. The dynamic Hamiltonian Monte Carlo methods that we use in Stan are exceptional in that regard, as their unique diagnostics are very powerful for identifying obstructions to the desired central limit theorem.

Consequently in general there are two main strategies for quantifying error. Firstly you can use models with analytic expectations so that you can compute the error directly, and secondly you can use an algorithm like dynamic Hamiltonian Monte Carlo to construct empirical baselines. In the latter case you want to carefully check all diagnostics, run many chains, and run them for as along as possible to maximize the probability of seeing and problems that are there and minimizing the error on the empirical expectation estimates.

The measure of computational effort can consider wall time (if you can restrict comparisons to specific hardware, compilers, operating systems, and the like), number of density or gradient evaluations (which quantifies how efficient the algorithm is but not its specific implementation), or even something like wattage if you want to quantify how literally expensive the computation might be.

Yep. He’s just wrong. Wrong wrong wrong wrongity wrong. If your chain is behaving well (ie is geometrically ergodic) then what he said makes sense. If it’s not, you just get one long bad chain.

The reason you want multiple chains started from diverse points is to check if your chain is behaving well. (This is what Rhat does) The (very small) efficiency loss is worth it to check you’re not computing rubbish.

3 Likes

Yes, that is the source of my doubts about using ESS, especially in this context.

Question(s): In the limit the expected ESS per sample converges to some non-negative constant (assuming a well-behaved chain)? So, if I’m generating the data and I know the “correct” value of a parameter, I could estimate ESS from, for example, how the squared-error (correct vs model estimate) changes with the number of samples? And if the chain does not behave, squared-error can detect this even in cases where estimated ESS wouldn’t indicate any problems. In other words, if I can, why not measure variance and bias. And if this is all true, what is the point of using ESS in cases where we know the data-generating process?

Because if a central limit theorem holds then the ESS quantifies the precision of the MCMC estimator without having to know what the true expectation is. Having a well-posed measure of error built into the algorithm (at least under the circumstances requires for a CLT) is one of the things that makes Markov chain Monte Carlo so powerful.

Yes, but if I have the true expectation and thus a strictly better metric (one that is equivalent to ESS when CLT holds and better when it doesn’t), there is no point in reporting ESS.

I would not go that far. In some sense you are empirically calibrating the ESS in experiments like this so while the naive ESS estimator need not always be relevant, including it is useful to understand how the estimator behaves in various circumstances where we don’t know the true expectation values.

It would be good to careful with definitions. Effective sample size is a theoretical concept that can be estimated with or without knowing the truth (some of the figures in this thread N_eff BDA3 vs. Stan - #21 by avehtari show both). Thus, please be more specific to which estimate of ESS you are referring to.

As discussed e.g. in BDA3 chapter 10, if we want to estimate the posterior distribution of theta and we would get S independent draws then the scale of uncertainty of theta is sqrt(1+1/S)sigm, it really doesn’t help us to
have large S, and correspondingly with dependent draws large ESS. This is useful to remember when comparing to distributional approximations which in best cases have very small RMSE which would correspond to really high ESS (and analytic solution would have infinite ESS). Naturally this makes generic comparison more complicated as we really should report then how these estimates are going to be used and which error matters.

I was referring to all estimates of ESS, even those that use the true value.

I think your post summarizes very nicely exactly what I’ve been trying to say. ESS measures how the estimation varies around what it is trying to estimate, with complete disregard to whether that which is being estimated is close or far away from the true value of the parameter. Laplace approx./INLA/some degenerate MCMC basically have infinite ESS… So, if I have the true value of the parameter, why not just measure squared distance (between estimations and true value), which we can still decompose into bias and a term that corresponds to ESS.

But does it really in the case of using means? I’d say the bias and variance are all that matters, regardless of how the estimated mean is going to be used, as they contain practically all the information about the distribution of the true mean. … Assuming we have taken enough samples (in this case I’m referring to number of chains at a fixed number of MCMC samples; one chain = one sample from estimated mean) and that the estimates behave well enough for CLT to apply.

Thanks for pointing me to this discussion! Very useful.

Thanks, but that’s not what I tried to say.

I tried to say that you can relate squared distance directly to ESS. Deterministic method is just unable to reduce that given more time. Although in many cases you can use the deterministic approximation as importance sampling distribution and then you can sometimes improve ESS given more time (see https://arxiv.org/abs/1507.02646 and [1802.02538] Yes, but Did It Work?: Evaluating Variational Inference)

But why investigate only mean? How about variance or posterior intervals we like to use to quantify our uncertainties. Of course you can say these are means of some functions, but that is my point that it’s not just the means of the parameters but also some functions of those parameters and that requires also more thought.

I apologize. It still seems to me like we are talking about the same thing, so something is obviously not clear to me. I’ll try to be more specific.

A MCMC chain (of some length) can be viewed as an estimator of a parameter’s mean. It has some variance and, relative to the true value of the parameter, a bias. The squared error of the estimate can be decomposed into two parts, one that comes from the variance and one that comes from the bias.

The part that confuses me now is that ESS can be directly related to squared error (as in, if I know ESS I know the squared error). As far as I understand it, ESS relates to the variance part, but not the bias part of the squared error, so, at least when the true parameter is not known, there is no way of directly relating it to squared error. It would be easy to construct two MCMC samplers with N_eff =N with different squared errors. Did you implictly assume that the estimator is unbiased or that we are able to take out the bias?

And yes, deterministic approximation cannot reduce the variance part, but that is because it is already 0 (=> infinite ESS). Or am I missing something?

(thanks for the papers! I’ve browsed through the PSIS one before, but I need to read it more thoroughly)

The exact same argument applies to ESS.

Yes, we all use posterior intervals in our Bayesian analyses (although it would be quite ironic not to), but I haven’t seen many papers that go beyond “my ESS for means is good, my SE of the means are negligible and now I can freely interpret any function of my posterior”.

The point isn’t that 50% performance doesn’t matter—that’s a huge improvement by anyone’s standards. The point is that it’s hard to measure accurately within a binary order of magnitude because of factors varying from system hardware to OS to compiler to load to CPU to everyting else that impinges on computation these days.

The one place this really matters, though, is when you have one computation you want to do and can tune it. Then everything else is fixed for you, and you can just tune the thing you really care about, which is your wall time (or watt-hours used, or whatever your metric is).

What do you care about? ESS/time gives you accuracy of your posterior expectation calculations, which is usually the only thing we care about for MCMC.

Sure, if you have a provably geometrically ergodic chain and you know roughly what the answers are and you’ve debugged your software, by all means get up on the high wire without a net.

Agreed. We’re trying to figure out how to bring this down.

But if any of those things Michael mentions are ill-behaved, then all of these comparisons are meaningless, because none of the systems are going to be getting the right answer. At that point, you have to back off to some kind of approximate measure like prediction in some task or something.

No. The problem is that you’re fitting a sample, so the “correct” answer for the sampler isn’t the true value, it’s the posterior relative to the same y you drew. Consider a simulation y1, y2, y3 ~ normal(0, 1). I might get -1.2 -1.9, and 0.1 in a simulation, at which point the mean is -1, not 0. Same thing holds with other processes and more data.

1 Like