How to adjust model specification based on posterior predictive check result

Before providing the details of the actual model, I want to first collect some general ideas regarding how to adjust model specification based on posterior predictive check result if the model seems to be misspecified. One thing I know is that for example if I am modeling count data using a Poisson distribution and the observed data show more 0s than predicted data, it is better to use negative binomial distribution instead of Poisson distribution. But besides that, I am not sure what specific instructions could posterior predictive check give.

In my case, I am modeling count data with zero-inflated negative binomial distribution and random effect. The data drawn from the posterior predictive distribution is a lot larger than the observed data. Below is the QQ plot with x-axis representing observed data and y-axis representing predicted data. The line is the x=y line.
Screen Shot 2021-06-13 at 7.55.30 PM

In addition, by checking the proportion of 0s in the observed data vs. the predicted data, I found that the proportion of 0s in the predicted data is actually higher than the observed data. So I switched to negative binomial distribution but it doesn’t help at all. I was wondering if there is any general advice on this type of situation? I will provide more details of the model if someone is interested in taking a closer look.

Some additional quick questions:

  1. Will model reparameterization affect posterior predictive check? Another way of asking this question is that, will reparameterization change a misspecified model into a more correctly specified model?
  2. I have also seen high autocorrelation during model fitting. Is this related to the specification of the model or it is an independent issue that is related to other things like sampler and number of iteration?



sorry for not getting to you earlier, this is a relevant and well written question!

Unfortunately I am not aware of any such general ideas. In my experience, the problems and solutions differ extremely from case to case and there are few shared principles beyond “understand your model and data” and “think hard” :-/

So I don’t think I can provide much better advice without seeing the model and the data.

That is not unexpected - it is quite possible that the fitted zero inflation in the inflated model was well informed by the data and fitted as very low (as the data don’t seem to support much zero inflation) and thus both models could provide almost identical predictions.

I would try to understand why does the model predict so large values. One possible cause is that your prediction code/model do not match or have other bugs? It is quite rare for linear models to not get at least the overall mean of the data right, but it seems like your model has problems even there, hence I would suspect a bug.

Please do.

Usually not - unless the original model is badly behaved (you get divergence / high R-hat / …) a reparametrization should result in the same predictions.

Could be. Quite often, computation problems (including high autocorrelation) indicate that the model has trouble fitting the data (Andrew calls this the “folk theorem”), but it is not a hard rule and it could be a separate issue.

Best of luk with your model!

1 Like

Sorry for the late reply and thank you for the useful answers! I tested different versions of the model and finally found out the problem. The data I am trying to model has a very high-level overdispersion and the overdispersion is dependent on another variable. The models that don’t work are all based on the assumption that the dispersion parameter is constant. After modeling it as a random variable that is dependent on another variable, everything works pretty well. Again, thank you for all your answers! They are very helpful!

1 Like