Just did my first science blog. There was an interesting INLA vs JAGS comparison at https://www.precision-analytics.ca/blog-1/inla, so I also added Stan to the mix. Surprised how well INLA works (at least in the case I tested). Any feedback welcome:
Nice blog post!
Did you test if Stan converges better for the largest sample size if you use centered parameterization? My understanding is that non-centered parameterization is not necessarily better than centered if one has lots of data.
Obviously I’m not surprised INLA works ;p (My experience is that most people who think it doesn’t haven’t tried it.). We’ve tested it pretty thoroughly on the types of models it applies to and it these results basically always hold. A few random thoughts:
- The transformation that takes the hyper parameters from the internal scale to the natural scale can introduce more error than necessary. You’d get even better results if you compared the log-precision, which you can get from
- Your instinct that you should do more runs is correct. Because the data is random, all the comparison statistics are random variables and they might have some quite heavy tails! I wrote a little about that here: http://andrewgelman.com/2017/10/20/barry-gibb-came-fourth-barry-gibb-look-alike-contest/
- We are putting some work into speeding up Stan for some of these models by incorporating some INLA-like ideas, so the speed difference might go down.
And while Daniel mention only INLA, I can tell that in independent experiments with GPstuff software (and based on many papers using other code) using LA, INLA, and EP distributional approximations which use Gaussian approximation to integrate over the latent values work very well for specific set of Gaussian latent variable models (with some extension to near-Gaussian) even in high dimensional cases (it just happens that in these cases the conditional posterior is so close to Gaussian that things are easy).
On the example in the post, centered parametrization has lots of divergent transitions for the smaller sample sizes. So I would not bet on it being successful in larger dataset. Also I didn’t really mess a lot with the larger data as these took quite some time to fit.
And thanks everybody for the kind words!
Guido’s point was that non-centered is good for small sample sizes, but it can be bad for larger datasets.
Enough theory, you nagged me into actually running the centered parametrization :-) On the model + seed + data in the blog, I get low BFMI warnings for all N and for N = 25000 I also get 330 transitions exceeding max treedepth. The inferred param distributions are almost identical except for some deviations in tau_nu (on the order of few percent for both mean and sd). The execution times are comparable with centered slightly faster for N <= 5000 and slightly slower for N = 25000.
For this particular case I therefore see no advantage in using centered parametrization.
I just realized, that I forgot to mention that if the number of parameters increase with increasing number of the data (as in your model), then likelihood will not dominate and non-centered stays better than centered.
@avehtari: I wasn’t aware of the role of the number of parameters. Thanks for pointing this out.
The theory’s pretty clean here and holds up in practice. Betancourt and Girolami explain ho the parameterization depends on how constrained the posterior is by a combination of the data and the prior. It predicts exactly the same behavior we see in practice, which is not surprising given that it’s based on simulations.
It’s missing the most important part of a blog—space for comments. So I’ll comment here.
It looks like you just used the default number of iterations and chains in Stan. After warmup has converged, it’s basically a matter of how much precision you want in the posterior expectations—that is, what’s the posterior MCMC standard error? You might also not need all that warmup if you want to be running Stan at ideal settings. The same thing goes for BUGS.
Did you run Stan in four cores in parallel on a machine with four physical cores? Was there any multi-threading or multi-process stuff engaged for INLA?
Plots are much easier to read than tables. I’d suggest fewer digits of precision in your tables if you must use tables as you note there’s variation due to seeds.
The problem with those diffuse gamma priors is getting convergence with either Stan or JAGS (the sampler may not be geometrically ergodic) and with overly heavy tails of the priors pulling harder than the data. Also, we prefer to put priors directly on scale (std. dev) as it’s on the same scale as the mean and thus much easier to understand. Rather than
sqrt( / tau_nu)
you wantinv_sqrt(tau_nu)`—it has specialized gradients. Not that it’ll matter here as you only do this calculation once.
Good point - it already is getting way more attention than I expected, so I have setup Disqus on the site. And thanks for all the input!
As for your other points: I was running both programs at default settings: Stan at 4 chains, INLA used all 8 cores of the machine. I understand that this (and other choices in the analysis) are not exactly a fair comparison of algorithms, but I wanted to keep the perspective of a normal user (which happens to be my perspective mostly). And from the user’s perspective it is IMHO more important how long does the program run with default settings. Users are also IMHO more interested in the order of magnitude than exact timing (so I think it is OK-ish to run just once - more would obviously be better). I also deliberately kept the exact model as in the original INLA post. I did however check that using the preferred prior on std. dev or other minor changes do not affect the runtime much (only tested for smaller datasets).
I’ll look into the formatting/presentation issues - currently it is just the default (and simplest) output I could get :-).
Yes. I think this is correct. Comparing the interfaces (tweaked to make the model the same) as used (which is almost always with defaults unless something goes wrong) will more accurately reflect user experience.
I think the number one priority should be getting the right answer. The problem is that how close you get to the right answer depends on number of iterations.
Changing number of iterations from 2000 to 200 would change the run time by an order of magnitude. Running in parallel on 8 cores rather than on a single core (Stan’s default) would be another order of magnitude. These matter.
In other words, I think knowing how long it takes to run the default settings is meaningless without a quantification of the accuracy of the answer.
People do this all the time and report that JAGS finishes the same number of iterations faster than Stan. What they forget to do is adjust for effective sample size (i.e., precision).
I don’t know how to solve this problem. Maybe we need to implement the timed runs that Andrew is advocating and run them until we get to an effective sample size of 20 by default.
How to compare different algorithms for Bayesian inference in terms of speed (and subject to effectiveness)
I guess, I am moving towards metaphysical plane :-), but if you don’t mind, I’ll explain my views a bit more.
I agree with your points, but I don’t believe they reflect experience of Stan users like me (which, I guess is a large fraction). In particular, I think that tweaking no. of iterations/chains is a power user feature. I, as a user of Stan (or INLA) operate under the assumption that the algorithm runs until it gives me a “reasonable” precision and that it will tell me if something broke badly. Note that JAGS does not have such guarantee, which IMHO disqualifies it from measuring the experience of a non-power user as JAGS basically requires you to be a power user.
If nothing breaks, I keep the defaults. It therefore IMHO is important how long the model runs with default settings. If you believe that runs with default settings are overly slow, it might be a reason to reevaluate the defaults - for example it might make sense to add an early stopping criterion when n_eff for all parameters goes beyond a certain value. Or find a way to figure out that you are already exploring the typical set and cut the warmup short (or count some warmup samples retroactively as sampling). Currently, the only simple way to determine that you need fewer iterations is to fit the model with more iterations first - and then it kind of loses the point.
Also, once you have one chain per core, changing the number of chains or number of samples you collect after warmup cannot cut the execution time below 50% (and frequently even less as the warmup iterations tend to be slower than sampling). And determining how many warmup iterations you need is even harder then determining what n_eff you should aim for - once again, IMHO something only power users can do confidently. If INLA can effectively use as many cores as you give it, than it is IMHO a benefit of INLA over Stan and - for user purposes - it makes little sense to measure speed of INLA restricted to fewer cores.
Running on a single core is a default, but since the package startup message tells you to enable multicore, I guess multicore is “almost default” (BTW - why is single core a default? Makes little sense to me, but I didn’t observe the history of Stan).
I believe it is important to think about n_eff/time and other more precise metrics, but I also think it is important to think about the experience of an average user.