Bad chain diagnostic, but good data recovery

nerpa · December 14, 2020, 2:56pm

Hello all,

I have a single chain sampling with 1000 warmup and 1500 sampling iterations that I run with cmdstanpy. Running the diagnose command on the output data shows that the chain performs really bad (high divergence, very low BFMI).
However, my posterior predictive checks look very good and I am able to recover the original data.
What am I missing? How is this even possible?

Thank you!

Trung_Dung_Tran · December 14, 2020, 3:03pm

It may be better if you could provide model details such as R hat or the procedure/plot used for PPC. In general, I think, it is better to have convergence before talking anything else.

nerpa · December 14, 2020, 3:11pm

Thanks! R hats are all bad, but since I have only one chain I am not sure how this is indicative. What do you mean by the procedure used for the PPC?
I use the rng functions, as in the intertemporal choice task, for example:

ev_later   = amount_later / (1 + itc_k * delay_later);
ev_sooner  = amount_sooner;
y_pred = bernoulli_rng(inv_logit(itc_beta* (ev_later - ev_sooner)));

jsocolar · December 14, 2020, 4:09pm

Posterior predictive checking is good for detecting when a model provides poor fit to the data, but it is not a replacement for MCMC diagnostics, which don’t care about the quality of fit to data and instead have to do with whether the MCMC chains have converged to the correct posterior distribution (i.e. the correct distribution given the model, which may or may not be misspecified). Suppose the true posterior distribution for a one-parameter model is normal(0,3). And suppose your MCMC fitting fails and yields and incorrect posterior of normal(1, 0.1). This is terrible inference that must not be trusted, but posterior predictive checking would look fine (as long as the model isn’t badly misspecified)! We know posterior predictive checking would look fine (as long as the model is well specified) because normal(0, 0.1) is entirely consistent with the correct posterior and therefore must be capable of generating data that look comparable to the observed data.

Therefore, it is crucial to first get a model that passes the MCMC diagnostics (R-hat, effective sample size, divergences, BFMI). And then it’s time to do posterior predictive checking to see whether the model that you’ve fit is consistent with (and therefore hopefully a reasonable approximation to) the true data generating process.

Note that PPCs aren’t always very sensitive even for what they are designed to do (i.e. diagnose model misspecification). Posterior predictive checking is a bit of a fishing expedition to search for bad fit. Different PPCs will be sensitive to different forms of misspecification, and no PPC is sensitive to every form of misspecification. And even well-chosen PPCs (tailored by a careful researcher to detect particular relevant forms of misspecification) might be underpowered if the dataset is small.

nerpa · December 14, 2020, 5:46pm

Thank you so much!! I know it is a different topic, but where do I start to understand what is wrong with my model? I have ~25k parameters and a few million data points…
Also does it even make sense to run only 1 chain? I know that R-hat estimates are not indicative with 1 chain, but what if all the other parameters are ok?

martinmodrak · December 14, 2020, 6:46pm

Running a single chain is heavily discouraged precisely because it makes the Rhat much less sensitive. Debugging big models ishard and it is usually sensible to start with a smaller model and only a subset of the data - hard to say anythingmorespecific without seeing your model and/or your data. Some additional pointers can be found in the “Bayesian workflow” preprint by Gelman et al. and at Divergent transitions - a primer

Best of luck with your model!

Trung_Dung_Tran · December 14, 2020, 6:46pm

Because we can run chains parallelly, you should check with 3 or 4 chains. That way when you see the trace plot, it gives you more information than only one chain.

nerpa · December 14, 2020, 7:09pm

Thank you all! My model was sampling fine (with 1 chain) before I added the QR decomposition. So I am trying to figure out what’s up there.
I realize that one chain is suboptimal and that all Stan tutorials use 4 chains. Would it make sense to run 2 chains, or is this heavily discouraged as well?

martinmodrak · December 14, 2020, 7:12pm

The more the better, so 2 better than one. :-) Eventually it is about the trafeoff you are willing to make between compute resources consumed and the risk of getting wrong/nonsensical inferences…

nerpa · December 14, 2020, 7:15pm

I totally get it, but for me it is a difference between 20 days or 5 days of sampling :)
Thanks!

nerpa · December 14, 2020, 7:17pm

But I will run at least 2 chains.

martinmodrak · December 14, 2020, 7:27pm

Still, it is a) quite likely there is a problem with the model, so running with a subset and simpler model first is definitely a good step (because it let’s you iterate much faster). Issues with the model also frequently mean the model samples much slower (a.k.a. the “folk theorem”) and b) you might want to consider using an approximation: INLA is quite a good package for approximate multilevel models if your model can fit this box. But point a) still holds even there - especially testing the smaller model/data works with Stan, checking the results on smaller model data are the same between Stan and INLA and then running INLA for the big model is IMHO a very robust workflow.

Trung_Dung_Tran · December 14, 2020, 7:56pm

I think, because of this, it takes a long computational time. If you have to run with the full dataset, I would still suggest to check the Stan program with a subset first. Maybe you see something already with that subset. To check for problem with codes, convergence (but not final estimates), in my experience, I often check with 100, 500, 1000, or 2000. After that, I often set 5000 for final estimates.

Topic		Replies	Views
Multi-chain vs single-chain Developers	7	2197	March 7, 2023
Prooblem about Model and Distribution General	6	632	April 18, 2019
Model Checking/Validating Best Practices General rstan	3	1256	December 21, 2022
Chains not mixing, exceeding treedepth in epidemiological model Modeling	6	738	December 16, 2020
Last chain hangs up during fitting Modeling fitting-issues	5	2784	July 30, 2019

Bad chain diagnostic, but good data recovery

Related topics