Text for warning message

andrewgelman · July 23, 2020, 2:35pm

Hey, I just fit a model and got the following warning message:

1: There were 18 divergent transitions after warmup. Increasing adapt_delta above 0.8 may help. See
http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup 
2: Examine the pairs() plot to diagnose sampling problems

I have no problem with the warning message–if my chains have divergent transitions, I’d like to know. I’m not really sure if 18 out of 4000 is a little or a lot, so it could be useful to have guidance on that. But that’s not my main concern here.

Also I like the pointer to the documentation–that’s great–and it makes sense to suggest the pairs() plot.

The thing that’s bothering me is the suggestion, “Increasing adapt_delta above 0.8 may help.” I’m kinda worried that this is encouraging people to sweep problems under the rug, also that it pushes people toward a slower version of Stan where, to be safe, they set adapt_delta to 0.999 or whatever.

Maybe the warning message could say something like, “Your model may be poorly identified. Consider using stronger priors.”

Or something like that? I’m not sure of the exact wording. I just know that in practice it can help to use stronger priors, and typically this prior info is available.

Not always–I recognize that sometimes you’re trying to fit the model you want to fit, and it’s just weakly identified, and you want Stan to explore the damn posterior distribution–but often your computational problems can be fixed with just a bit of regularization.

It would be good if the warning message were to say this. Alternatively, the warning message could not say this, and it could also not mention adapt_delta at all. It could just point to the documentation page which would have all this discussion. But I don’t think it’s good that right now we privilege the adapt_delta suggestion.

maxbiostat · July 23, 2020, 2:43pm

@betanalpha

andrewgelman · July 23, 2020, 3:13pm

Yes, this relates to Michael’s remarks in this other thread: Improve warnings for low ESS

bgoodri · July 25, 2020, 3:35pm

I think we need to convey that 1 divergent transition is a lot. Possibly 0 divergent transitions is a lot if it just never encountered the part of the parameter space where it would diverge in 4000 post-warmup iterations.

It is true that divergent transitions are often associated with heavy-tailed priors or using a centered parameterization of a hierarchical model. The difficulty is that the message can’t say “reduce the prior standard deviation on this normal prior” or “switch this to a non-centered parameterization”, so it would end up saying something vague like “reevaluate your priors and other modeling choices”. And many users of Stan aren’t up for that or are using rstanarm, brms, or another package where the user doesn’t have any control over the parameterizations.

Increasing adapt_delta doesn’t always work, and even when it does work, there could very well be something else that would work better. But increasing adapt_delta does reduce the expected step size and does not increase the expected number of divergences. So, it is something simple that Stan users can do irrespective of what the model is and it might suffice.

More broadly, there is a movement toward thinking about priors as just being tuning parameters for a MCMC algorithm. You see that line of thought in a lot of places, but I don’t think we want Stan users doing things like choosing the largest value of a prior standard deviation that achieves zero divergences because the divergences are a product of everything, including the data you are conditioning on. So, a prior chosen in that way isn’t prior.

maxbiostat · July 25, 2020, 4:20pm

I for one am not very keen on this take. Modelling and fitting should be kept separate as far as possible, as a matter of principle. That’s my position, at least.

andrewgelman · July 25, 2020, 9:44pm

It’s fine that Stan reports divergent transitions. I just don’t think it’s a good idea for it to say, “Increasing adapt_delta above 0.8 may help,” in the warning message. We already point people toward the webpage. That should be enough. They can read the webpage and see the recommendations.

To put it another way: I bet that in a lot of places where people increase adapt_delta to 0.999, it just makes the program take longer and get to a bad answer. In other cases, it probably just takes longer and gives the same answer. I’m not at all clear that increasing adapt_delta is good general advice. It can be in the toolkit, fine, but I think it’s a mistake to have it as the single piece of advice we give to people. Not a tragic mistake, but a mistake nonetheless.

Max_Mantei · July 25, 2020, 10:47pm

I saw this quite a few times here in the forum: People seek help because their model runs slowly. Then you see they have something like adapt_delta = 0.9999 and still a couple of hundred divergent transitions. Really people should get used to re-parameterize, but this can be a big hurdle (especially when you’re new to Stan). I guess a lot of people then do the one thing they can do (cranking up adept_delta) and hope it’ll be okay. So yeah, it’d probably make sense to remove the adept_delta-“hint”.

andrewgelman · July 25, 2020, 10:52pm

Yes, we can give it as one of several directions to go. I think that sometimes people have the impression that increasing adapt_delta is “safer” or “more conservative” than adding prior information. But, from a statistical perspective, regularization is a very conservative thing to do, whereas increasing adapt_delta (or running 10^5 iterations, which is another thing that people traditionally do) will not necessarily help at all. It will just make the program take longer, thus delaying the inevitable. I think this is similar to the point that Michael B. was making in that other thread.

Just to be clear: there are surely some cases where increasing adapt_delta is a good idea. We should keep it in the toolbox. I just don’t think it should be the first tool, or the only tool, that people reach for.

bgoodri · July 26, 2020, 2:07am

Without a pretty constructive recommendation in the message, I would hypothesize that a lot more people will just report results that have divergent transitions. Someone tell me why I am wrong.

andrewgelman · July 26, 2020, 2:12am

Here are two proposals for a replacement warning message:

(a)

1: There were 18 divergent transitions after warmup. See
http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
2: Examine the pairs() plot to diagnose sampling problems

(b)

1: There were 18 divergent transitions after warmup. Using a stronger prior may help. See
http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
2: Examine the pairs() plot to diagnose sampling problems

bgoodri · July 26, 2020, 5:13am

I wouldn’t call either of those a pretty constructive recommendation. It is like saying “Be better”. Saying “using a stronger prior may help” doesn’t say which parameter the user should consider using a stronger prior on.

mcol · July 26, 2020, 12:49pm

If we really want to teach the importance of diversions, the message should be clearer about that. Perhaps something along the lines of:

There were 18 divergent transitions after warmup, which may compromise the validity of the estimates. See http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup to find why this is a potential problem and how to remove them.

andrewgelman · July 26, 2020, 2:59pm

Mcol:

I like that warning: clear and concise.

bgoodri · July 26, 2020, 9:02pm

I put that message into the latest rstan, but it’s going to result in a higher percentage of invalid analyses.

andrewgelman · July 26, 2020, 9:07pm

This is something that ultimately we will need to handle in the documentation, to give people a good set of steps to follow when this happens. We could add an example to that webpage if that would help.

bbbales2 · July 26, 2020, 9:45pm

What about a warning message for adapt_delta > 0.99 or max_treedepth > 12 saying it’s time to reparameterize or regularize?

Or the increase adapt_delta message stops appearing if adapt_delta > 0.9

bgoodri · July 26, 2020, 9:53pm

What is the lowest value of adapt_delta that you have seen work with something like a horseshoe prior?

bbbales2 · July 26, 2020, 10:20pm

Fair, I guess I really don’t have that good a sense on the range of adapt_delta values. I do find upping it useful. I don’t think the current message (with adapt_delta = 0.8) is particularly bad or anything.

avehtari · July 27, 2020, 9:21am

With regularized horseshoe, 0.8 has worked more than once, but I didn’t test any lower values (and then sometimes 0.999 is not enough).

jonah · July 27, 2020, 5:53pm

I like this message, but I think we need to improve the page it links too. It’s not that it’s wrong, it’s just a lot paragraphs that should be rewritten or reorganized. I think users will be more likely to read it and pay attention to it if we can formulate it more as a checklist than as an essay (maybe it’s not possible to provide a checklist but at least something more user friendly than it currently is). Also it currently doesn’t really have an example of adding more prior information, just increasing adapt_delta and reparameterizing.

@andrewgelman Since you care a lot about this and are also a super fast writer, do you want to try writing something up that we can potentially use as the page we link to in the message?

Topic		Replies	Views
Divergent transitions - a primer General howto , divergences	8	15693	August 20, 2025
Divergent transitions warning message Developers divergences	7	1322	March 23, 2021
Improve warnings for low ESS Developers	31	4372	August 5, 2020
Divergent transitions Modeling	12	1066	July 17, 2019
Divergences in a non-centered computational model Modeling fitting-issues	21	1515	October 30, 2019

Text for warning message

Related topics