Excellent paper on Comparing JAGS, NIMBLE, and Stan

@Bob_Carpenter @betanalpha @andrewgelman @avehtari @paul.buerkner (kindly tag others for whom this may be relevant)

I came across this paper that compares JAGS, NIMBLE, and Stan using a detailed and consistent framework across 4 different classes of models (Linear, Logistic, Mixture and Accelerated Failure Time models). I currently use all of the three tools in my work and I thought this would be of interest to Stan developers and end users alike. Here is the link to the paper:

2107.09357v1.pdf (arxiv.org).

The main conclusion is as follows:

Taking into account all of our simulations, Stan appears the go-to solution if one wants to learn only one between JAGS, Stan and NIMBLE programs. In fact, the NUTS HMC sampler employed by Stan provides always high-quality chain (measured by the index E). Since HMC relies on the gradient of the log posterior density, considering non-informative priors such as the uniform distribution, for which the gradient is computationally cheap to compute, provides a speed-up for the sampling time required by Stan.

Additional caveats and scenarios where JAGS and NIMBLE perform better than Stan are also identified:

In general, conjugate or semi-conjugate models (i.e., when the full-conditionals are in closed form) are very well fitted by JAGS, which is usually faster than Stan and it is able to generate chains with high effective sample size. In fact, since in conjugate or semi-conjugate models full conditional distributions are obtained in closed form, the Gibbs sampler used by JAGS samples from them easily and quickly. As a result, JAGS is faster than Stan, which is penalized by the computation of the gradient that requires some time.

NIMBLE has proved to be the best software to fit mixture models, where the Gibbs sampler was used, in this case being the fastest software and the one with highest quality of the chains, especially in the case of four components mixtures.

Sree

3 Likes

I’d like to know two things:

  1. When they say Jags is faster than Stan, is that per effective sample size? Per-iteration comparisons would not be appropriate when comparing different algorithms.
  2. When they fit mixture models, did they use the mixture formulation (not the discrete latent-data formulation) as described in the Stan manual?

I’ve found the GitHub repo with the Stan models that they used here: GitHub - daniele-falco/software_comparison

Unfortunately, it looks like their models are using the less efficient constructions for the various likelihoods (e.g., not using the _glm distributions, and using log_sum_exp instead of log_mix for the mixture distributions)

1 Like

@andrewgelman - they report on 3 different metrics for each of the 4 model types and for different priors. These are the metrics they compare on:

  1. Average effective sample size for the regression coefficients,
  2. compilation time in seconds and
  3. iterations per second for the model under different priors.

They have a github page (linked in the paper) and see here ( GitHub - daniele-falco/software_comparison ) where all the synthetic data and model specifications are available.

It appears @andrjohns .has already examined the specs of the Mixture models.

Sree

@andrjohns It would be nice to change these specs to the more efficient Mixture model setup and re-test the model results and subsequent comparisons.

Hi, yes, I saw that. I don’t think iterations per second is the right thing to use. It might not make much of a difference in this case–I have no idea–but I know that users coming from Bugs/Jags often get into the habit of running 100,000 iterations, and when they come to Stan we have to train them to run fewer.

Here is a link to one of the authors page with greater in depth details on how they compare across different software / tools / algorithms:

MCMC_comparisons

They acknowledge that their code for Stan might not be the most efficient as follows:

To put that another way, the other package developers might be able to get big improvements if they do some customization for each model as well. For example, the Stan team has pointed out that their code for some of the classic BUGS examples could be re-written for more efficient performance. When they translated those examples to Stan code, they roughly followed the template of the original BUGS code rather than trying to squeeze more performance out of Stan.

This goes counter to everything I thought to know. Is this really true?

2 Likes

The log_mix documentation still states that it only works with real values ( 3.14 Composed functions | Stan Functions Reference (mc-stan.org)).

In the SUG it shows that one should use log_sum_exp ( 5.3 Summing out the responsibility parameter | Stan User’s Guide (mc-stan.org)).

We need to update the docs if we want people to use these functions. I’m actually a bit confused on how to use the updated log_mix myself!

1 Like

I suppose it marginally decreases the cost of each gradient evaluation, but frequently at the expense of more evaluations per iteration and more iterations per effective sample. I imagine that in virtually all important classes of model, the cost of evaluating the gradient of the prior pales in comparison to the other considerations.

Just for a demo:

parameters {
  vector[10000] x;
}

model {
  target += normal_lpdf(x | 0, 10);  //suppose this is the likelihood
  target += std_normal_lpdf(x);      //and suppose this is the prior.
//Removing the final line above yields marginal speed-up despite that doing 
//so makes the posterior match worse to the initial guess for the inverse metric 
// (which is also the thing that the inverse metric gets regularized towards).
//Presumably, the reason for this speed-up is the cost of the extra gradient evals.
}
2 Likes

Can anyone ascertain how they chose the number of warmup iterations in Stan, and how they chose the number of adaptation and burn-in iterations in JAGS?

1 Like

Thanks for sharing this. I notice a couple things about their metrics that are interesting.

Their measures are (as @sree_d noted):

Average effective sample size for the regression coefficients (\mathcal E_\beta), compilation time in seconds (tc) and iterations per second (N_it/t_s)

Their raw measure of sampling “quality” is ESS as a fraction of total samples, \mathcal E = ess / N_s.

What they report is the average of \mathcal E across all parameters; in practice, at least as I’ve been seeing it, its the slowest sampling parameter that governs efficiency since I want to keep going until I see that there aren’t any parameters with low ESS.

Before calculating \mathcal E, they thin the chains:

In some cases, it might happen that ess > N_s; chains where this phenomenon occurs are called antithetic. In order to have \mathcal E \in [0, 1], we always assume Nthin equal to 2, that is we discard one every two iterations of the Markov chain, thus avoiding the antithetic behavior.

Wouldn’t this bias the results against an algorithm that is designed to avoid random walk behavior? Trying to remove negative autocorrelation to measure ess/N_s doesn’t make sense to me.

3 Likes

I have to respectfully disagree with @sree_d: if anything this paper demonstrates many common mistakes that are made when trying to compare Markov chain Monte Carlo methods.

Every probabilisitic computation method approximates expectation values with respect to a given target distribution; in the Bayesian context that target distribution is a posterior distribution of interest. The only well-defined performance metric across all possible methods is the cost needed to achieve a given approximation error for a given expectation value.

Markov chain Monte Carlo does not, in general, have any guaranteed error quantification and so there’s no way to quantify that performance without considering functions whose expectation values are known exactly already. In some cases, however, a Markov chain Monte Carlo central limit theorem holds which quantifies the approximation error probabilistically. Assuming that the central limit theorem holds then the scale of the error for the expectation value of the function f is given by the Markov chain Monte Carlo standard error,

\text{MCMC-SE}[f] = \sqrt{ \frac{ \text{Var}_{\pi}[f] }{ \text{ESS}_{\pi}[f] } },

where \text{ESS}_{\pi}[f] is the effective sample size for that function. For much more see Markov Chain Monte Carlo in Practice.

If a Markov chain Monte Carlo central limit theorem holds for all of the samplers being considered then the sampler performance can be ranked by \text{MCMC-SE}[f] per unit computational cost. The (wall clock) run time is often used as a proxy for computational cost. If only the ranking is of interest, and not the actual distance between performance quantifications, then the same ranking is given by comparing \text{ESS}_{\pi}[f] per computational cost. Be careful, however, because \text{ESS}_{\pi}[f] and \text{MCMC-SE}[f] are nonlinearly related the differences in performance will not scale in the same way.

If a Markov chain Monte Carlo central limit theorem doesn’t hold then the effective sample size \text{ESS}_{\pi}[f] is meaningless. An empirical effective sample size can be evaluated but it won’t correspond to any meaningful quantity let alone quantify performance.

In this ideal circumstance where all of the central limit theorems hold then \text{ESS}_{\pi}[f] / \text{Cost} can be further decomposed into \text{ESS}_{\pi}[f] / N_{\text{iterations}} and N_{\text{iterations}} / \text{Cost}. In other words the incremental effective sample size and the cost to run each iteration of the Markov chain. Care has to be taken, however, because neither number means much by itself; it’s only the product that matters for quantifying sampler performance.

With all of that said, in order to properly compare performance between Markov chain Monte Carlo algorithms we would have to

  1. Fix a target distribution \pi

  2. Verify that all samplers satisfy Markov chain Monte Carlo central limit theorems for \pi.

  3. Fix a well-behaved function of interest f.

  4. Run each sampler and compute \text{MCMC-SE}[f] / \text{Cost} or \text{ESS}_{\pi}[f] / \text{Cost}.

Step 2 is very difficult for many samplers. In practice often the best we can do is verify it for one sampler and then use the estimated expectation value from that sampler as a ground truth when quantifying the errors for the other samplers. At the very least the sampler estimates have to be compared to ensure that they’re all compatible with each other. This isn’t perfect but often the best we can do.

The NIMBLE comparison paper doesn’t verify Markov chain Monte Carlo central limit theorems, it doesn’t compare the actual estimates for consistency, and it doesn’t make comparisons in the context of a single, fixed function. There are multiple other misunderstandings throughout the paper – for example with automatic differentiation the cost of evaluating the gradient is not much higher than evaluating the target density itself and, as @jsocolar mentioned, the cost per gradient is an incomplete number by itself as it doesn’t take into account the effective sample size per gradient evaluation. Both of these numbers are affected by the choice of prior model but in very different ways.

Markov chain Monte Carlo, and probabilistic computation in general, is a very much both conceptually subtle and mathematically sophisticated. It’s complex and difficult and misunderstandings are inevitable. But sloppy comparisons like this propagate those misunderstandings, which then confuse users who just want to quantify their posterior distributions.

7 Likes

@betanalpha Thank you so much for your detailed note. Your note and some of the observations others have posted above is precisely why I posted the link to this paper. It is beyond my pay grade to critically examine such articles, as I consider myself a novice in this space. So I was hoping the more experienced practitioners of Bayesian modeling will examine and share their thoughts on the paper.

Searching on the web, such comparisons have been extremely hard to find. This is the first paper I found that did a JAGS vs. NIMBLE vs. STAN at some depth utilizing different types of models. I hope there will be more comparisons undertaken which will expand our Bayesian understanding.

Sree.

Why is this “Bayesian understanding”? Solving Bayesian problems requires usually MCMC… but in all honesty, I would be really happy to take something else which just runs faster and spits out the posterior as a sample, a density or whatever… as long as it is correct.

Don’t want to derail … but MCMC is only a computational method to solve a problem. Quite important for Bayes, for sure.

1 Like

@wds15 I agree with you that MCMC is a computational method that need not be burdened with Bayesian or other labels.
I mentioned Bayesian understanding because, personally, I associate probabilistic way of operationalizing, measuring, estimating, and predicting with Bayesian approaches rather than frequentist approaches. However, I do agree, a better phrasing would be something like “improving our understanding of different MCMC samplers and techniques”

Thanks for taking the time to read and share your feedback!

Sree