Improve warnings for low ESS

ADMIN EDIT: I moved this post to its own topic, it is not the poster’s fault that it refers to a previous (but mostly unrelated) discussion.

Diverting the thread somewhat, but this is something for which I think there should definitely be a good guidance discussion somewhere. In my experience of learning Stan, it was very common to see a model with the classic " Running the chains for more iterations may help." messages, but as far as I can see the linked page at Runtime warnings and convergence problems doesn’t have much of a discussion around when you’re likely to have taken things too far. In my long, slow, learning process I’ve definitely run models with 64,000 iterations just because things seemed to be getting slowly better, and the “more iterations may help” messages kept coming. From reading this forum, it seems that I’m far from the only one who has kept throwing iterations at a problem beyond what is reasonable.

I know that there is unlikely to be a hard and fast rule, but I’ve fairly frequently seen comments telling people that they should never need to run more than n-thousand iterations. It seems that there’s folk knowledge out there that should be distilled into guidance for beginners who run into this common problem. I’d certainly appreciate that guidance now, and would have loved it a year or two ago.

Alternatively, maybe this is spelled out somewhere and I’ve just missed it!

7 Likes

Well, Stan spits out a lot of warnings, that’s true.

If you are just concerned about the mean of a parameter, then you are good with an ESS of about 100; if more precision is needed, then 200-300, but really not more. Should you worry about the precision of some quantiles, then thee ESS tail measure is a good one to consider, see

2 Likes

Sure, this is great advice. My issue is not that the information isn’t out there, but that is seems to be distributed across a fair number of forum posts and accreted know-how (and maybe academic papers) rather than in a centralised resource page somewhere that is accessible to someone relatively new to Stan.

From my experience, I’d say that the “more iterations may help” is one of the most common warnings I’ve seen, but it would seem very valuable for the linked page about that warning to have a line somewhere saying: “If you find yourself running more than n-thousand iterations, you may have deeper problems. See the discussion here: …”, and link to a fairly concise, set of guidelines that can point people in the right direction.

My concern right now, based on my own experience, is that people just end up running more and more iterations, then eventually find their way to the forum to be told that that was a bad idea all along.

EDIT: Just to be clear: these forums are amazingly friendly and helpful. I more mean that this seems a common enough issue that it deserves a dedicated page somewhere that can save people the trouble of finding their own way to the solution.

7 Likes

Just moved this to its own topic as I think it deserves some discussion. It is kind of similar to the advice to change adapt_delta for divergences which also rarely helps and many people end up visiting the forums with calls like adapt_delta = 0.9999. I think that both the warning and the warnings page should probably switch the focus a bit and say something like “The most likely case is that the model is problematic, but for some models running for more iterations can also help.” (dtto for the divergences message)

8 Likes

This and improving many other diagnostic messages and corresponding web pages is on my todo list, but unfortunately that list is quite crowded. If someone wants to help let me know. I can also review any proposals for improved texts.

2 Likes

I’ve found it to help a lot with hierarchical models. So much so I might suggest we just increase the adaptation target. But that’s a different thread :-)

We need to make clear that we only need more iterations until effective sample size is high enough. If we aren’t mixing, we probably need to reparameterization rather than run more iterations.

This is all going to be relevant if @andrewgelman manages people to push the defaults down to 200 warmup and 200 adaptation or whatever he wants. Then running more iterations will often be necessary.

2 Likes

I will try to write something up soon on our general recommendation to improve these geometries by adding relevant prior information to soft-constrain the model.

2 Likes

Great thread! Maybe some sort of list of things to do could help new users? Something like:

"Your model converged low efficient sample size for some parameters, indicating that posterior estimates may be unreliable. Try this:

  1. Check if the parameters in your model are correctly specified and make sense.
  2. Tighten up the priors. Does sampling improve? Your sampler might have been lost in a weird region of the parameter space.
  3. Check correlation between parameters using the “pairs()” plot.
  4. Increase the number of iterations using “iter =”. Be wary of diminishing returns.
2 Likes

Somewhat related: should we add a warning when users request ridiculously high iterations? I’ve occasionally seen folks using very high values out of the gate, presumably when they’re used to older MCMC algorithms that necessitated high iterations and strong thinning. Something along the lines of:

“You have requested a large number of iterations. This is usually not necessary (tail quantities needing only X samples for reliable posterior estimation), but can be required in some rare cases. If you have already run this model with a standard number of iterations, obtained a low ESS and subsequently exhausted your options to reparameterize the model for better geometry, you can ignore this warning.”

9 Likes

I do hope that the user will be able to use a switch to turn off some or all of the warnings. If you have a replication package and a reviewer/reader runs it and gets warnings they will become suspicious, even though one has checked everything. I think this is even more important in new fields where Bayesian analysis is starting to be used more and more.

4 Likes

Thanks for all comments they are very useful. We (I and @paul.buerkner) are writing down all the ideas and we’ll contact you when we have something. We’ll think how to organize the ideas and various related feedback so that they don’t get lost in discourse.

2 Likes

This is really dangerous. Lots of statisticians have said this to us before, starting with @andrewgelman. The programmers among us have prevailed so far to keep the warnings in place.

I would suggest instead that the problem is warnings that are just suggestions (don’t use too few or too many iterations) or that are subject to false positives (missing Jacobian on transform; integer division, etc.). I suggest we move these to a linter (or “pedantic mode” in the @andrewgelman lexicon, my favorite of which is his term for unit and regression testing, which, like many idioms, apparently fails when translated (to Japanese)). Then when we get warnings or errors in default mode, we’ll know they’re something we should pay attention to.

We’ve come a long way since FUD around open source days. So much so that you now get FUD for closed source. So just hang on a few decades and everything should work out with all this anti-Bayes sentiment.

Agree completely. If there are things one must care about then they shouldn’t be optional. Many new users (read: reviewers) will be afraid to see replications spit out things that seem to be serious, but are actually completely normal.

I like what Martin wrote, “The most likely case is that the model is problematic, but for some models running for more iterations can also help.” But, to elaborate, often what is happening is not that the model is problematic but that we can put in some prior information and all will work fine. This has happened to be again and again: I fit a model and start with flat or near-flat priors, then I get some convergence issues which are immediately resolved by just including reasonable priors.

I agree with everyone in the above thread who said that we should avoid giving users the default advice to increase the number of iterations and increase adapt_delta. As the folk theorem tells us, poor convergence is itself a model diagnostic!

Allow me to suggest Markov Chain Monte Carlo in Practice.

The problem with responding to diagnostics is that sampler failure is many to one – a nearly infinite number of pathologies will result in the same failures message. In general one has to hypothesize possible pathologies and investigate (using for example the spatial location of divergences), but ideally one would be using well understood modeling techniques were the possible pathologies and their manifestations are documented so you can quickly identify the problems. See for example

https://betanalpha.github.io/assets/case_studies/divergences_and_bias.html
https://betanalpha.github.io/assets/case_studies/gp_part3/part3.html
https://betanalpha.github.io/assets/case_studies/underdetermined_linear_regression.html
https://betanalpha.github.io/assets/case_studies/ordinal_regression.html

$100 your parameterization is wrong or your population location/scale prior is too diffuse.

2 Likes

The prior’s part of the model, so putting in prior information is changing the model. Maybe not the likelihood. I have the same issue with @betanalpha’s comments. Sure, we can put in “prior information” and change the model, but it doesn’t change the fact that Stan fails to fit the model we originally specified, despite it having a proper posterior.

Where do you think that’s happening? It’s when there are divergences and it’s a suggestion for how to fix them. That very often works for me.

This isn’t poor convergence so much as poor initialization of sampling parameters. It will converge

Lowering the step size and increasing adapt_delta often gets rid of the handful of divergence warnings I get.

I’m not sure what you even mean by a prior being “wrong”. Does the prior and likelihood and data cause divergence warnings with default default inits and step size and adapt delta? Yes, it does. But that’s a problem with Stan not being able to sample from the model I specified—the priors and posteriors are all proper. The prior is only “wrong” in the sense that it won’t work with Stan for a given likelihood and data set.

Yes, that’s right. What I should’ve said was, “Often what is happening is that the model is problematic but the problematic part can be fixed by including prior information that is already available but which had not yet been included in the model.”

I’d like a recommendation saying something like, “This problem often arises when parameters are poorly identified. Often if you include some moderately informative priors, the problems will go away.”

But at the expense of increased computation. Users pushing adapt_delta = 0.999, or interfaces defaulting to adapt_delta = 0.99 just tries to hide degeneracy problems behind increased computation that at best makes it hard for users to identify the underlying problems and at worst burns a load of carbon only to turn more divergences.

Having a proper posterior density doesn’t imply that we should be able to fit the model accurately, let alone efficiently. The infamous inv_gamma(epsilon, epsilon) priors from the early BUGS era are proper but cause no end of problems in practice. Do you expect Stan to fit models with those proper priors?

Divergences aren’t just indicating that Stan can’t accurate quantify a given posterior density function. The pathologies that mess up Hamiltonian Monte Carlo and induce divergences are almost always (I have yet to see an exception) due to degeneracies in the posterior density function which indicate that the chosen parameters are not independently well-informed by the data or the prior model. The model is not wrong in any philosophical sense but rather in the sense that it conflicts with the tacit assumption that the parameters were sufficiently well informed to be able to learn anything useful about them.

In particular, you can burn more compute to push Hamiltonian Monte Carlo to explore the degenerate posterior more accurately but will it be worthwhile if the final inferences aren’t all that useful because of the limited information?

Reparameterizations, more careful priors, and the like are all ways of adding information or concentration available information in more effective ways. We know to consider these approaches, however, only when we follow up on the divergences. Conveniently I just published a case study on this topic, Identity Crisis.

For example in hierarchical models do you have unbalanced data? Have you centered and non-centered each individual parameter separately based on the sparsity of the data (as a proxy for the width of the likelihood function)? Do you have enough individuals, or enough individuals with enough data, to learn the population scale well enough to break the population location/population scale degeneracy?

2 Likes

Thanks. The whole notion of identifiability is super confused in the Bayesian world. I wish we’d come up with a different world so as not to confuse everyone coming to this from the frequentist world where it’s a property of only the likelihood function and penalty.

I’m particularly curious about where we potentially have to throw out useful information because it’s just too weak for our limited MCMC-based inference to extract. For instance, two highly correlated parameters that aren’t fully independent should be manageable but often are not (as in the hierarchical model example).

Even in the best of circumstances, MCMC is terrible for the environment compared to approximate methods. But hey, at least we’re not burning up the planet as fast as the neural network folks.

Does that mean that the model’s still bad after I get rid of divergences? How do I know? How do I know other models aren’t also bad that don’t throw divergences? Is SBC enough? I suppose I should stop writing and read the case study!

Nope. Stan still can’t do everything.

That’s not a degeneracy unless you somehow conflate inference with tight estimates of parameters. Bayes should be able to average over that uncertainty and still give me reasonable inferences. Things break down because of computation, not because the parameters aren’t tightly concentrated.

For example, I’m working on a model of gene splice variant expression now. That’s technically not identified in some circumstances, because there may be two alternate explanations that even a ton of data won’t concentrate on one or the other solution. I’d argue this is a feature not a bug. The classical methods just break because matrices can’t be inverted, but we can deal with the uncertainty in Bayes by just propagating it. For some splice variants, we’ll get tight inference, others might be very diffuse.

I agree that’s the question we should be asking before we even break out MCMC in the first place.

2 Likes

I would guess that when users do something like increase an HMC tuning parameter such as

it is in part because, at least in the documentation I’ve seen, these are a bit of an arbitrary number and the advice has been, bump it up and see what happens. Turn the issue around and ask, what is special about adapt_delta = 0.8 to make it the default? Does 0.8 hide degeneracy problems in comparison to 0.7? I suspect the answer is, we’ve run a bunch of models and empirically determined that’s a good number, but that doesn’t feel very satisfactory for understanding how to think about models.