Text for warning message

Hey, I just fit a model and got the following warning message:

1: There were 18 divergent transitions after warmup. Increasing adapt_delta above 0.8 may help. See
http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup 
2: Examine the pairs() plot to diagnose sampling problems

I have no problem with the warning message–if my chains have divergent transitions, I’d like to know. I’m not really sure if 18 out of 4000 is a little or a lot, so it could be useful to have guidance on that. But that’s not my main concern here.

Also I like the pointer to the documentation–that’s great–and it makes sense to suggest the pairs() plot.

The thing that’s bothering me is the suggestion, “Increasing adapt_delta above 0.8 may help.” I’m kinda worried that this is encouraging people to sweep problems under the rug, also that it pushes people toward a slower version of Stan where, to be safe, they set adapt_delta to 0.999 or whatever.

Maybe the warning message could say something like, “Your model may be poorly identified. Consider using stronger priors.”

Or something like that? I’m not sure of the exact wording. I just know that in practice it can help to use stronger priors, and typically this prior info is available.

Not always–I recognize that sometimes you’re trying to fit the model you want to fit, and it’s just weakly identified, and you want Stan to explore the damn posterior distribution–but often your computational problems can be fixed with just a bit of regularization.

It would be good if the warning message were to say this. Alternatively, the warning message could not say this, and it could also not mention adapt_delta at all. It could just point to the documentation page which would have all this discussion. But I don’t think it’s good that right now we privilege the adapt_delta suggestion.

4 Likes

@betanalpha

Yes, this relates to Michael’s remarks in this other thread: Improve warnings for low ESS

I think we need to convey that 1 divergent transition is a lot. Possibly 0 divergent transitions is a lot if it just never encountered the part of the parameter space where it would diverge in 4000 post-warmup iterations.

It is true that divergent transitions are often associated with heavy-tailed priors or using a centered parameterization of a hierarchical model. The difficulty is that the message can’t say “reduce the prior standard deviation on this normal prior” or “switch this to a non-centered parameterization”, so it would end up saying something vague like “reevaluate your priors and other modeling choices”. And many users of Stan aren’t up for that or are using rstanarm, brms, or another package where the user doesn’t have any control over the parameterizations.

Increasing adapt_delta doesn’t always work, and even when it does work, there could very well be something else that would work better. But increasing adapt_delta does reduce the expected step size and does not increase the expected number of divergences. So, it is something simple that Stan users can do irrespective of what the model is and it might suffice.

More broadly, there is a movement toward thinking about priors as just being tuning parameters for a MCMC algorithm. You see that line of thought in a lot of places, but I don’t think we want Stan users doing things like choosing the largest value of a prior standard deviation that achieves zero divergences because the divergences are a product of everything, including the data you are conditioning on. So, a prior chosen in that way isn’t prior.

9 Likes

I for one am not very keen on this take. Modelling and fitting should be kept separate as far as possible, as a matter of principle. That’s my position, at least.

4 Likes

It’s fine that Stan reports divergent transitions. I just don’t think it’s a good idea for it to say, “Increasing adapt_delta above 0.8 may help,” in the warning message. We already point people toward the webpage. That should be enough. They can read the webpage and see the recommendations.

To put it another way: I bet that in a lot of places where people increase adapt_delta to 0.999, it just makes the program take longer and get to a bad answer. In other cases, it probably just takes longer and gives the same answer. I’m not at all clear that increasing adapt_delta is good general advice. It can be in the toolkit, fine, but I think it’s a mistake to have it as the single piece of advice we give to people. Not a tragic mistake, but a mistake nonetheless.

4 Likes

I saw this quite a few times here in the forum: People seek help because their model runs slowly. Then you see they have something like adapt_delta = 0.9999 and still a couple of hundred divergent transitions. Really people should get used to re-parameterize, but this can be a big hurdle (especially when you’re new to Stan). I guess a lot of people then do the one thing they can do (cranking up adept_delta) and hope it’ll be okay. So yeah, it’d probably make sense to remove the adept_delta-“hint”.

1 Like

Yes, we can give it as one of several directions to go. I think that sometimes people have the impression that increasing adapt_delta is “safer” or “more conservative” than adding prior information. But, from a statistical perspective, regularization is a very conservative thing to do, whereas increasing adapt_delta (or running 10^5 iterations, which is another thing that people traditionally do) will not necessarily help at all. It will just make the program take longer, thus delaying the inevitable. I think this is similar to the point that Michael B. was making in that other thread.

Just to be clear: there are surely some cases where increasing adapt_delta is a good idea. We should keep it in the toolbox. I just don’t think it should be the first tool, or the only tool, that people reach for.

3 Likes

Without a pretty constructive recommendation in the message, I would hypothesize that a lot more people will just report results that have divergent transitions. Someone tell me why I am wrong.

1 Like

Here are two proposals for a replacement warning message:

(a)

1: There were 18 divergent transitions after warmup. See
http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
2: Examine the pairs() plot to diagnose sampling problems

(b)

1: There were 18 divergent transitions after warmup. Using a stronger prior may help. See
http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
2: Examine the pairs() plot to diagnose sampling problems

I wouldn’t call either of those a pretty constructive recommendation. It is like saying “Be better”. Saying “using a stronger prior may help” doesn’t say which parameter the user should consider using a stronger prior on.

If we really want to teach the importance of diversions, the message should be clearer about that. Perhaps something along the lines of:

There were 18 divergent transitions after warmup, which may compromise the validity of the estimates. See http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup to find why this is a potential problem and how to remove them.

7 Likes

Mcol:

I like that warning: clear and concise.

I put that message into the latest rstan, but it’s going to result in a higher percentage of invalid analyses.

This is something that ultimately we will need to handle in the documentation, to give people a good set of steps to follow when this happens. We could add an example to that webpage if that would help.

What about a warning message for adapt_delta > 0.99 or max_treedepth > 12 saying it’s time to reparameterize or regularize?

Or the increase adapt_delta message stops appearing if adapt_delta > 0.9

1 Like

What is the lowest value of adapt_delta that you have seen work with something like a horseshoe prior?

Fair, I guess I really don’t have that good a sense on the range of adapt_delta values. I do find upping it useful. I don’t think the current message (with adapt_delta = 0.8) is particularly bad or anything.

With regularized horseshoe, 0.8 has worked more than once, but I didn’t test any lower values (and then sometimes 0.999 is not enough).

I like this message, but I think we need to improve the page it links too. It’s not that it’s wrong, it’s just a lot paragraphs that should be rewritten or reorganized. I think users will be more likely to read it and pay attention to it if we can formulate it more as a checklist than as an essay (maybe it’s not possible to provide a checklist but at least something more user friendly than it currently is). Also it currently doesn’t really have an example of adding more prior information, just increasing adapt_delta and reparameterizing.

@andrewgelman Since you care a lot about this and are also a super fast writer, do you want to try writing something up that we can potentially use as the page we link to in the message?

1 Like