Just did my first science blog. There was an interesting INLA vs JAGS comparison at https://www.precision-analytics.ca/blog-1/inla, so I also added Stan to the mix. Surprised how well INLA works (at least in the case I tested). Any feedback welcome:
Nice blog post!
Did you test if Stan converges better for the largest sample size if you use centered parameterization? My understanding is that non-centered parameterization is not necessarily better than centered if one has lots of data.
Obviously I’m not surprised INLA works ;p (My experience is that most people who think it doesn’t haven’t tried it.). We’ve tested it pretty thoroughly on the types of models it applies to and it these results basically always hold. A few random thoughts:
- The transformation that takes the hyper parameters from the internal scale to the natural scale can introduce more error than necessary. You’d get even better results if you compared the log-precision, which you can get from
- Your instinct that you should do more runs is correct. Because the data is random, all the comparison statistics are random variables and they might have some quite heavy tails! I wrote a little about that here: http://andrewgelman.com/2017/10/20/barry-gibb-came-fourth-barry-gibb-look-alike-contest/
- We are putting some work into speeding up Stan for some of these models by incorporating some INLA-like ideas, so the speed difference might go down.
And while Daniel mention only INLA, I can tell that in independent experiments with GPstuff software (and based on many papers using other code) using LA, INLA, and EP distributional approximations which use Gaussian approximation to integrate over the latent values work very well for specific set of Gaussian latent variable models (with some extension to near-Gaussian) even in high dimensional cases (it just happens that in these cases the conditional posterior is so close to Gaussian that things are easy).
On the example in the post, centered parametrization has lots of divergent transitions for the smaller sample sizes. So I would not bet on it being successful in larger dataset. Also I didn’t really mess a lot with the larger data as these took quite some time to fit.
And thanks everybody for the kind words!
Guido’s point was that non-centered is good for small sample sizes, but it can be bad for larger datasets.
Enough theory, you nagged me into actually running the centered parametrization :-) On the model + seed + data in the blog, I get low BFMI warnings for all N and for N = 25000 I also get 330 transitions exceeding max treedepth. The inferred param distributions are almost identical except for some deviations in tau_nu (on the order of few percent for both mean and sd). The execution times are comparable with centered slightly faster for N <= 5000 and slightly slower for N = 25000.
For this particular case I therefore see no advantage in using centered parametrization.
I just realized, that I forgot to mention that if the number of parameters increase with increasing number of the data (as in your model), then likelihood will not dominate and non-centered stays better than centered.
@avehtari: I wasn’t aware of the role of the number of parameters. Thanks for pointing this out.
The theory’s pretty clean here and holds up in practice. Betancourt and Girolami explain ho the parameterization depends on how constrained the posterior is by a combination of the data and the prior. It predicts exactly the same behavior we see in practice, which is not surprising given that it’s based on simulations.
It’s missing the most important part of a blog—space for comments. So I’ll comment here.
It looks like you just used the default number of iterations and chains in Stan. After warmup has converged, it’s basically a matter of how much precision you want in the posterior expectations—that is, what’s the posterior MCMC standard error? You might also not need all that warmup if you want to be running Stan at ideal settings. The same thing goes for BUGS.
Did you run Stan in four cores in parallel on a machine with four physical cores? Was there any multi-threading or multi-process stuff engaged for INLA?
Plots are much easier to read than tables. I’d suggest fewer digits of precision in your tables if you must use tables as you note there’s variation due to seeds.
The problem with those diffuse gamma priors is getting convergence with either Stan or JAGS (the sampler may not be geometrically ergodic) and with overly heavy tails of the priors pulling harder than the data. Also, we prefer to put priors directly on scale (std. dev) as it’s on the same scale as the mean and thus much easier to understand. Rather than
sqrt( / tau_nu)
you wantinv_sqrt(tau_nu)`—it has specialized gradients. Not that it’ll matter here as you only do this calculation once.
Good point - it already is getting way more attention than I expected, so I have setup Disqus on the site. And thanks for all the input!
As for your other points: I was running both programs at default settings: Stan at 4 chains, INLA used all 8 cores of the machine. I understand that this (and other choices in the analysis) are not exactly a fair comparison of algorithms, but I wanted to keep the perspective of a normal user (which happens to be my perspective mostly). And from the user’s perspective it is IMHO more important how long does the program run with default settings. Users are also IMHO more interested in the order of magnitude than exact timing (so I think it is OK-ish to run just once - more would obviously be better). I also deliberately kept the exact model as in the original INLA post. I did however check that using the preferred prior on std. dev or other minor changes do not affect the runtime much (only tested for smaller datasets).
I’ll look into the formatting/presentation issues - currently it is just the default (and simplest) output I could get :-).
Yes. I think this is correct. Comparing the interfaces (tweaked to make the model the same) as used (which is almost always with defaults unless something goes wrong) will more accurately reflect user experience.
I think the number one priority should be getting the right answer. The problem is that how close you get to the right answer depends on number of iterations.
Changing number of iterations from 2000 to 200 would change the run time by an order of magnitude. Running in parallel on 8 cores rather than on a single core (Stan’s default) would be another order of magnitude. These matter.
In other words, I think knowing how long it takes to run the default settings is meaningless without a quantification of the accuracy of the answer.
People do this all the time and report that JAGS finishes the same number of iterations faster than Stan. What they forget to do is adjust for effective sample size (i.e., precision).
I don’t know how to solve this problem. Maybe we need to implement the timed runs that Andrew is advocating and run them until we get to an effective sample size of 20 by default.
How to compare different algorithms for Bayesian inference in terms of speed (and subject to effectiveness)
I guess, I am moving towards metaphysical plane :-), but if you don’t mind, I’ll explain my views a bit more.
I agree with your points, but I don’t believe they reflect experience of Stan users like me (which, I guess is a large fraction). In particular, I think that tweaking no. of iterations/chains is a power user feature. I, as a user of Stan (or INLA) operate under the assumption that the algorithm runs until it gives me a “reasonable” precision and that it will tell me if something broke badly. Note that JAGS does not have such guarantee, which IMHO disqualifies it from measuring the experience of a non-power user as JAGS basically requires you to be a power user.
If nothing breaks, I keep the defaults. It therefore IMHO is important how long the model runs with default settings. If you believe that runs with default settings are overly slow, it might be a reason to reevaluate the defaults - for example it might make sense to add an early stopping criterion when n_eff for all parameters goes beyond a certain value. Or find a way to figure out that you are already exploring the typical set and cut the warmup short (or count some warmup samples retroactively as sampling). Currently, the only simple way to determine that you need fewer iterations is to fit the model with more iterations first - and then it kind of loses the point.
Also, once you have one chain per core, changing the number of chains or number of samples you collect after warmup cannot cut the execution time below 50% (and frequently even less as the warmup iterations tend to be slower than sampling). And determining how many warmup iterations you need is even harder then determining what n_eff you should aim for - once again, IMHO something only power users can do confidently. If INLA can effectively use as many cores as you give it, than it is IMHO a benefit of INLA over Stan and - for user purposes - it makes little sense to measure speed of INLA restricted to fewer cores.
Running on a single core is a default, but since the package startup message tells you to enable multicore, I guess multicore is “almost default” (BTW - why is single core a default? Makes little sense to me, but I didn’t observe the history of Stan).
I believe it is important to think about n_eff/time and other more precise metrics, but I also think it is important to think about the experience of an average user.
I think it’s smarter to do this backwards - if it’s taking a long time, see if you can fit the model with just a few iterations and see how many effective samples you get. If it’s enough, stop; otherwise, do some back of the napkin math to guess how many you might need to get enough.
We don’t teach it as a power user feature. I actually think Stan is basically for power users in very many ways. By the time you get to Stan you have decided that there is no pre-packaged Bayesian version of the model you’d like to fit and you need to build your own. I think the suggestions you’ve made for cutting down on time violate necessary MCMC conditions (not that that stops some software), but @betanalpha can tell you more about that.
I think there is an important meta point here - everyone assumes that they personally represent the average Stan user (myself included, if we project out to “what would the average Stan user look like if I worked hard enough”). We should probably actually measure that. @lauren and @breckbaldwin are working on a survey to figure that out; I think you might have insight into some questions it would be good to ask that we didn’t think of yet (e.g. “do you know what an effective sample is,” “do you care about getting enough of them,” etc).
It think pretty much all of our users are in some minority or other as it’s a very heterogeneous group. But I take your point that a lot of our users would think of changing settings a a power-user feature.
This indicates we need to rethink what we’re doing in the way Andrew’s been talking about to monitor convergence online and only run until necessary. The problem is that we don’t know how much precision to give as a result as it will depend on the application. Our inclination would be to run until n_eff = 50 or so, which probably only means a few dozen post warmup iterations in each chain, which then isn’t enough to actually estimate n_eff, so we have to run a bit longer until we can reliably estimate we have n_eff > 50 (the n_eff calculations themselves are expensive as they involve an FFT on each column of draws).
Therein lies the problem we face. Nothing will break in traditional low-dimensioanl examples if we crank down the settings to 200 warmup and 200 post-warmup iterations (1/10 of what we use now). We’d look a lot better in all these comparisons. Like this one to choose another example that just came up in another thread:
But those settings aren’t as robust to harder problems. So we decided to make easier problems slower in order to make harder problems more robust. So it hurts every time someone compares easy problem settings.
I’m not sure what you mean here. Warmup doesn’t have to be 50% of the runtime. If you need n_eff=1e5 and only need 1e3 warmup iterations, then most of your time’s spent sampling, not warming up.
You’ll need to ask Ben or Jonah. CRAN places a lot of restrictions on packages, so that might be it. Or it might just require fewer dependencies. At one point (maybe still), error messages were getting swallowed with multicore. Getting all this stuff running under R is a huge problem.
I agree with your points, just a minor thing:
Yes, but that is IMHO rather infrequent, when running with defaults, warmup is in my expereince well over 60% as the iterations tend to take longer before adaptation kicks in. And you usually don’t need that much
n_eff while making sure the warmup succeeded is a high priority (and harder to check).
Also I don’t think any of this is important when comparing Stan to INLA on large models as the speed difference is an order of magnitude. I however have a model I am interested in where INLA is biased, so I hope to write about it as well sometimes. Will have to use INLA though as Stan is a no-go due to speed and the bias of INLA might be acceptable - I need to fit thousands of hierarchical models with thousands of parameters each.
The warmup iterations usually take a bit longer than the sampling iterations, especially for complex models where the step size needs to be set low to learn the scales.
I was suggesting it would make sense to override the defaults for a factor of two speedup in the case where you need a lot of posterior draws relative to how long warmup takes in the default settings.
Right. That’s why we took a rather conservatively long warmup period. It takes a few iterations to estimate variance reliably anyway. But then after convergence it also takes quite a few iterations to estimate n_eff and R-hat.
Unless you have Google amounts of power to burn, INLA’s a far better bet if it will fit the models you care about. We’re working with Dan Simpson on adding some of INLA’s marginalization tricks to Stan. But like INLA, they’ll be restricted to certain flavors of models.