To satisfy my own curiosity, I’ve been doing some comparisons of Stan and JAGS in terms of sampling efficiency (second per effective draw) for some simple hierarchical models. I thought it also might be of interest to others, so I decided to write it up in a blog post, which can be found here: http://www.boelstad.net/post/stan_vs_jags_speed/

If anyone has suggestions for making the comparison more fair or relevant, I’d be happy to update accordingly.

I may also consider turning this into some kind of research note, and if anyone from the development team would be interesting in joining, just let me know.

@joeboe
I read your research note just now and found it to be very interesting and informative. I’m between a beginner and an intermediate in my exposure and understanding of anything Bayesian. I’m planning to take your code and also replicate the results you have shared.

I’m doing similar comparisons but between Stan vs. pymc3 vs WinBugs with straight forward multiple regression models on real-world datasets. Once done I hope to share these results.

When using effective sample size to compare samplers you have to be very careful to first verify that the effective sample size is a meaningful quantity. Unfortunately effective sample size is meaningful only in the ideal conditions where a Markov chain Monte Carlo central limit theorem holds – see Sections 1 and 2 of https://arxiv.org/abs/1701.02434 and https://betanalpha.github.io/assets/case_studies/rstan_workflow.html for more details.

So far, I’ve been running three parallel chains, setting the warm-up length for each model to yield low split Rhats. I don’t think the Stan models are displaying problematic behavior (such as divergences), but I take your point that it is worth checking the models further (perhaps especially for JAGS). I haven’t meant to imply that non-centering is only about speed. I’ll try to be clearer about this.

One of the problems with the Gibbs samplers used in JAGS is that they don’t admit much diagnostic capabilities beyond R-hat and R-hat methods are much less sensitive then you might expect, especially in the context of hierarchical models. For example, the Gibbs samplers can explore the hyperparameters so slowly that it takes hundreds of thousands of iterations before you’ll see the pathological sticking behavior that identifies the bias and causes bad R-hats. More chains helps, although there are always practical limitations.

In many cases it’s only with comparisons to HMC methods that we’ve realized that earlier fits were in fact biased. Consequently in these comparisons you have to very carefully check the posteriors for consistency, which is discussed in more depth in the “Hamiltonian Monte Carl for Hierarchical Models” paper.

Again, the most important factor when comparing MCMC methods is that effective sample size is meaningless unless a central limit theorem holds and so one has to be careful to not compare effective sample sizes before diligently verifying the conditions for a central limit theorem as much as possible.