Number of iterations

Hello everybody,

This might be a very repeated issue from newbies as myself and even might be quite annoying. However, here I go.

I have found in papers of bayesian inference the use from a few thousands of iterations for warm-up and sampling, up to huge numbers (>100000).

In this post from the old google group, Bob suggested to check the n_eff simulations in order to check whether is necessary to increase the number of iterations (i.e., by doubling it).

By using shinystan, I have noticed that one can decide upon the warning threshold value (ranging from 0 to 100%, being by default 10% if am not mistaken).

So, my questions are some:

1. Is there any guidance on how to pick up this threshold?
2. In the mentioned post, someone asked if it is possible restart a new simulation with the output of a previous simulation and the answer by that time was that wasn’t possible, it is now?
3. I have some models where the montecarlo se / posterior sd and the Rhat statistic show no warning at all, while the n_eff/N is present in some estimates. Is this a sign that I should increase the number of simulations even thought the chains seemed to have converged?

1. Specifically for the Stan implementation of HMC you can expect for most well-parameterized models to get more than one effective sample per 10 iterations. So the 10% threshold is a good one. If you’re getting fewer than that you should consider other options for parameterization (unless you’re getting a big enough effective sample size anyway, in that case carry on).

2. I think it’s either close or already possible, not sure if it’s made it into the interfaces yet. It’s not much help though.

3. Nah, if your sample is big enough and you meet the convergence criteria don’t worry about it. OTOH if you’re running Stan for hours/days to get your samples you are a) wasting your time; and b) likely using a terrible parameterization (or a very hard model) and you could do much better.

4. All that stuff about running a million iterations and thinning by 10k is irrelevant for Stan/HMC, don’t do that.

2 Likes

This varies tremdously with the geometry of the problem. What you shoudl be seeing is that if you run it for twice as many iterations, you get twice as large an n_eff. It may be mixing slowly, but that will show you if it’s mixing.

In the works for Stan 3. Mitzi coded up the basic I/O, but we’re waiting on refactoring some of the interface code so we don’t have to code this all up twice.

This is only going to matter if (a) you have really long autocorrelation times in an otherwise well-behaved model, and (b) you don’t have enough memory. You always lose information by thinning.

Give users a chance not to learn everything at once: for lots of people doing fairly standard models it’ll be a loooooong time before they run into something that’s well behaved and gets few n_eff / 100 iterations… :)

Depends on what they’re doing. Lots of users jump in with really complex models. I think it’d benefit most of these users to start more gradually, as there are indeed lots of different things to learn when dealing with a new programming language, and a really big program isn’t the best way to do that (Andrew disagrees here—he hates the “hello world” appraoch as much as I love it).

Unfortunately, some of the relatively simple looking models like stochastic volatility time series or even just a high-dimensional multi-normal will require quite a high iteration to effective sample size ratio.

That’s a good point. I guess my bias is towards leading people in the right direction a step at a time, especially if they haven’t demonstrated that they’re unreasonable to begin with. OTOH if I tell somebody to start with small components and they can’t let their giant model go for a while, well, you can’t save everyone. It’s like the tough guy in the zombie movie who wants to go off on their own. ¯\_(ツ)_/¯

1 Like

Or the dumb kid in the horror movie. I like the analogy :-)

I had to search that on Google—not exactly clear from looking at it that it’s supposed to be a shrug emoji! Very appropriate, as I’m on vacation in France, the land of the perfect shrug (I can’t quite coordinate the pursed lips, exhalation and slight shrug that’s so expressive). Mitzi has an awesome book of French gestures, my favorite of which is the bullshit gesture, though I’ve never seen it used in France.

They’re sometimes called emojicons or kaomoji. See, for example, http://japaneseemoticons.me.

Some of the table flipping emojicons are works of art.

Flipping Table: (╯°□°）╯︵ ┻━┻

Putting Table Back: ┬──┬◡ﾉ(° -°ﾉ)

5 Likes

Fantastic link, thank you for making my life better.

I lived in France from 6 till 9 (?) I think, I can never keep the dates straight. Long enough to start acting like a little French child by the time we returned to the wrong side of the Berlin wall. Maybe this is why I feel like I should be able to gesture instead of talking sometimes (and why nobody here understands!) :)

What is a ‘big enough effective sample size’?

For example, I have 10,000 data points and ESS of ~300 for a few estimates - these would be picked up by the 10% threshold but not a 5% one. Otherwise, the chains converge, MCSE looks okay, pp_checks are reasonable, and estimates look okay. When I increase the iters, the ESS scales upwards proportionally (or more) but otherwise no changes to model estimates.

What are the pros and cons of selecting either n_eff/N threshold? I’d be keen to save time!

This got posted on the blog a few days ago: https://statmodeling.stat.columbia.edu/2019/03/19/maybe-its-time-to-let-the-old-ways-die-or-we-broke-r-hat-so-now-we-have-to-fix-it/

The Arxiv version: https://arxiv.org/pdf/1903.08008.pdf

The notebook version: https://avehtari.github.io/rhat_ess/rhat_ess.html

If you’re wondering about N_effs and especially quantiles, you’ll probly find that stuff interesting.

3 Likes

If your Markov chain behaves well enough then effective sample size controls the error of your MCMC estimators, such as the mean (see https://betanalpha.github.io/assets/case_studies/rstan_workflow.html for more details). So you want to generate enough effective sample to be sufficient for your application. In general there is no unique answer.

For example, if I all I want to do is crudely locate the mean within the marginal posterior distribution then error[f] ~ sqrt{ Var[f] } / 3 should be sufficient and that implies

\sqrt{ Var[f] / ESS[f] } = error[f] ~ sqrt{ Var[f] } / 3
3 ~ sqrt{ ESS[f] }
10 ~ ESS[f]

Typically 100 effective samples allows for a more precise evaluation of the posterior properties.

Just keep in mind that all of this holds only if you have a central limit theorem, which means also checking for divergences, Rhats, and the like.

Thank you very much for the information.