Text for warning message

Switching, for example, from a centered parameterization to a non-centered parameterization would be fine (if you are writing the Stan program yourself) because that doesn’t contradict anything you thought before you tried to draw from the posterior distribution and got divergence warnings.

In contrast, recommending that people tweak their priors in order to eliminate the divergence warnings is going down a dark road. It is implicitly saying, “if there is a funnel, use the prior to close that funnel by defining it to be outside a typical set of the posterior distribution”, which is just as biased as having divergent transitions. As I said before,

So, you end up with draws from a distribution that isn’t a posterior distribution, or rather, it is a posterior distribution for someone else who would have substantively favored the divergence-avoiding “prior” distributions you landed on.

That is the same bad thinking as choosing the prior to maximize the predictive performance in the testing data or choosing the prior to maximize the Bayes Factor. In contrast, adapt_delta actually is a tuning parameter and is intended to be used that way.

Basically, the only way in which prior-hacking to avoid divergent transitions is benign is if the original priors poorly reflected what you believed about the parameters before seeing the data. And if seeing the divergent transition warnings spurs you to specify priors that better correspond to what you believed about the parameters before seeing the data, then you are indirectly better off. But that is something that you should have already done before seeing the divergent transition warnings. It is something you should have resolved when you drew from the prior predictive distribution and / or did SBC.

And 90% of Stan users aren’t down for that anyway.

Yes, I think the original priors do typically poorly reflect people’s information.

Anyway, I think the best solution is that message that says that divergences are a real concern and points people to the webpage, without privileging any particular potential solution, whether it be priors or reparameterization or adapt_delta or anything else.

Right now the webpage privileges two potential solutions: increase adapt_delta and reparameterize. So to not privilege any solution it either needs an additional section about prior information (so that it’s not the only solution not mentioned) or the other suggestions need to be removed (I prefer the former since otherwise we’re not recommending anything).

So the webpage really needs to be edited. Any suggestions for the new content?

All true, but ignoring this and only mentioning adapt_delta just makes it less likely for people to consider doing prior predictive checks. That is, I think we’re responsible for some of that 90%. If we want to increase the percentage of people doing what we recommend then we should recommend that they do it. Right now we’re just recommending increasing adapt_delta (or reparameterizing) which does nothing to point people to investigating whether they actually included the prior information they should have.

Anyway, I think if we have warning messages with recommendations (or that point to a webpage with recommendations) then those recommendations should include what we actually recommend. In some cases that’s increasing adapt_delta, in others reparameterizing, and in other cases we recommend considering whether you’ve actually encoded the prior information you have. In fact that last one is the only one we always recommend doing and yet it’s also the only one we don’t mention in the warning or linked webpage.

1 Like

Changing adapt_delta and / or reparameterizing to a non-centered approach are compatible with Bayesian analysis, while hacking your prior to push a pocket of the parameter space out of the typical set of the posterior distribution is basically the same as ignoring the divergent transitions in the first place. It may well be the case that the priors people were using were inconsistent with their prior beliefs, in which case they should have specified their actual priors from the outset, irrespective of whether they get divergent transition warnings.

1 Like

I agree with you, Ben. I don’t think we should recommend “hacking your prior to push a pocket of the parameter space out of the typical set.” Maybe in the documentation we can say something like this:

"When there are divergences, this can be an opportunity for you to revisit your model. Often we have found that users have prior information available to them that they have not included in their model. If you have such prior information, you can add this to your model now, and, in addition to the benefits in statistical efficiency, this might alleviate the computational difficulties as well. For example, if you are nearly certain that a parameter will be less than 1 in absolute value, you could assign it a normal(0, 1) prior distribution, which can pull the posterior away from regions of parameter space which are substantively irrelevant but could cause computational problems.

We are not saying that adding prior information will necessarily resolve the computational problems revealed by divergences or poor mixing, nor are we recommending the use of a prior solely for the purpose of stabilizing computation."

2 Likes

Yeah I completely agree they should have specified their actual priors from the outset. I’m just saying our warning messages weren’t encouraging them to do that. I don’t think they should hack a prior to avoid divergences, but for people that haven’t learned how to specify priors well (which is hard) it doesn’t help them to avoid mentioning that and only mention adapt_delta. Then they’ll just have the same problem the next time instead of potentially learning to do a better job specifying their actual priors.

1 Like

I disagree that “they should have specified their actual priors from the outset.” Realistically, we build our models step by step.

People sometimes have a wrong intuition which goes like this: (a) a model without priors, or with very weak priors, is simpler or safer than a model with strong priors, and (b) simpler models are easy to compute. Put these together, and people are often mistakenly using too-weak priors because they think it will be easier to run. So to learn that strong and scientifically-supported priors can actually make computation work better . . . . that can be a revelation to people!

That wrong intuition is still prevalent and still wrong even if there are zero divergent transitions after warmup. There are a lot of posterior distributions that Stan can draw from just fine that yield overly imprecise estimates because the priors were garbage.

I’m really skeptical that giving people a wider menu of options and providing more examples in this area is going to improve estimation in the aggregate. It is certainly going to increase the prevalence of prior hacking to some degree. Increasing adapt_delta was simple, applicable to all models, and didn’t make things worse.

1 Like

I think increasing adapt_delta does make things worse! It encourages people to stay with bad models and it makes them spend more time on bad models. That’s bad!

Anyway, I’m ok with including adapt_delta on that webpage. My problem is that the current warning message only mentions adapt_delta. It privileges the sweep-the-problem-under-the-rug approach.

1 Like

Hmm, to me it seems that @andrewgelman and @bgoodri are both right but considering different subsets of our users. To summarize for anyone who is coming late to this, my understanding is that

  • Ben is concerned about people prior-hacking until they have no divergences (in which case it’s hard to call it a prior anymore). There are definitely users who would do this if we recommended changing priors to help with divergences.

  • Andrew is saying that many people wouldn’t be trying to prior hack, they just don’t know that it’s not a good idea to use intentionally weak priors, and so mentioning that it’s good to use informative or weakly informative priors will encourage better modeling practices.

Both of those points of view seem correct when considering particular subsets of the user base. There are definitely users who will tune their priors to avoid divergences. But there are also users who will want to learn that weak priors can be problematic for both inference and computation.

5 Likes

Good point. It would be good to make both points in the documentation that we refer to.

If people have a model that is undermined by their choice of priors that is inconsistent with their own prior beliefs, increasing adapt_delta does not make the results worse. Maybe the divergences go away, which is better because the draws correspond to the (bad) posterior distribution they defined. Maybe the divergences don’t go away, in which case they are in the same boat as they were before.

The only sense in which increasing adapt_delta makes anything worse is relative to a counterfactual where if they didn’t increase the adapt_delta they would instead change the priors to better correspond to what they actually believed before seeing the data and maybe they would have both a better model and no divergent transitions. I would guess that applies to at most 5% of Stan usage because most of the people who are willing and able to do that are already doing prior predictive checking.

On the other side, I’d venture there is about 25% of Stan usage (including those in rstanarm, brms, and other packages) where there are warnings with the Stan defaults but the results are salvageable for some values of adapt_delta and max_treedepth. Without having something simple, constructive, and repeatable in the warning message like “Try increasing adapt_delta”, some chunk of those are going to devolve into prior hacking, some chunk of those are going to ignore the warnings and report the results anyway, and another chunk of those are going to give up. None of those are great outcomes.

You write, “Maybe the divergences go away, which is better because the draws correspond to the (bad) posterior distribution they defined.” I don’t know that this is is better.

You write, “Maybe the divergences don’t go away, in which case they are in the same boat as they were before.” No, not the same boat: they’re running a bad model more slowly, which will take up time that they could be using to run a better model.

I think that this adapt_delta thing really is causing a problem in that the Stan community is getting littered with bad models with adapt_delta=0.999. People see the adapt_delta message and they don’t even think of fixing their models. In my experience, lots of computational problems can be fixed or alleviated using prior information that’s already in users’ possession. But they don’t think of adding this prior information, in part because lots of our examples (including my books! sorry!) use weak priors. So if we’re pointing people to things to do here, I very strongly think that priors be one of the suggested options.

Again, I agree that adapt_delta should be in the toolbox. I just think the current warning message is bad because it gives adapt_delta as the only step forward. Given that we’re pointing to a webpage anyway, I think it’s better to just point to the webpage and then users will see the story. We can change the warning message to this (Text for warning message):

“There were 18 divergent transitions after warmup, which may compromise the validity of the estimates. See http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup to find why this is a potential problem and how to remove them.”

and add the two paragraphs here (Text for warning message) to the documentation.

I’ve already changed the warning message to remove the part about adapt_delta on CRAN. I just don’t think giving people a second chance to change their prior distributions to correspond to what they really thought — when they already missed their first chance — is going to result in a lot of improved models. And taking adapt_delta out of the message is definitely going to unfix a bigger percentage of cases.

Just so that it doesn’t get lost for anyone updating the webpage, here is a list of recommendations I came up with for my blog on divergences - maybe this would be a good starting point?:

  1. Check your code. Twice. Divergences are almost as likely a result of a programming error as they are a truly statistical issue. Do all parameters have a prior? Do your array indices and for loops match?
  2. Create a simulated dataset with known true values of all parameters. It is useful for so many things (including checking for coding errors). If the errors disappear on simulated data, your model may be a bad fit for the actual observed data.
  3. Check your priors. If the model is sampling heavily in the very tails of your priors or on the boundaries of parameter constraints, this is a bad sign.
  4. Visualisations: use mcmc_parcoord from the bayesplot package, Shinystan and pairs from rstan . Documentation for Stan Warnings (contains a few hints), Case study - diagnosing a multilevel model, Gabry et al. 2017 - Visualization in Bayesian workflow
  5. Make sure your model is identifiable - non-identifiability and/or multimodality (multiple local maxima of the posterior distributions) is a problem. Case study - mixture models, my post on non-identifiable models and how to spot them.
  6. Run Stan with the test_grad option.
  7. Reparametrize your model to make your parameters independent (uncorrelated) and close to N(0,1) (a.k.a change the actual parameters and compute your parameters of interest in the transformed parameters block).
  8. Try non-centered parametrization - this is a special case of reparametrization that is so frequently useful that it deserves its own bullet. Case study - diagnosing a multilevel model, Betancourt & Girolami 2015
  9. Move parameters to the data block and set them to their true values (from simulated data). Then return them one by one to paremters block. Which parameter introduces the problems?
  10. Introduce tight priors centered at true parameter values. How tight need the priors to be to let the model fit? Useful for identifying multimodality.
  11. Play a bit more with adapt_delta , stepsize and max_treedepth . Example

I would also certainly point to Mike’s case study on identifiability and divergences… Identity Crisis and the vignette on visual diagnostics: Visual MCMC diagnostics using the bayesplot package • bayesplot

For easier maintainability, we might also want to replace the link with a link to Discourse.Discourse supports fixed routes, so we could have something like Divergent transitions - a primer - General - The Stan Forums link to a summary topic (in wiki mode) and be able to redirect it to a new topic, should it be in need of a serious update.

6 Likes

Yeah, I think that’s a good idea.

In fact we need a place to point CmdStanR users too, but the current webpage has example code just for RStan. Instead of each interface having a separate page for this, maybe it makes sense to have a conceptual page on Discourse like @martinmodrak suggested, and then that could link to interface-specific directions (for R, Python, Julia, whatever) if we want.

3 Likes

the page being linked to is currently part of the stan-dev.github.io website - cf this Website page on Guide to Stan's Warnings

people come to Discourse looking for answers, but this is something that should be in the Stan docs - there’s sections in both the Stan User’s Guide (advice) and Reference Manual (definitions), also in the CmdStan guide 17 diagnose: Diagnosing Biased Hamiltonian Monte Carlo Inferences | CmdStan User’s Guide

@mitzimorris Yeah you’re right that this is really a documentation issue.

Also, I didn’t notice this when I was reading the new CmdStan guide before (sorry) but I think this quote from the CmdStan guide (bold added by me)

If the divergent transitions cannot be eliminated by increasing the adapt_delta parameter, we have to find a different way to write the model that is logically equivalent but simplifies the geometry of the posterior distribution.

isn’t quite right because, as @andrewgelman is suggesting in this discussion, we don’t have to necessarily reparameterize if, for example, adding more prior information is sufficient. In the case of Neal’s funnel we have to reparameterize but not all cases are like that.

1 Like

what Discourse needs is a FAQ.
is there a hook for the “New Issue” which would alert folks to the existence of such a FAQ?
(sorry - I know that a) this is slightly off topic, and b) I didn’t check the FAQ before suggesting this)

1 Like