There have been many posts on the Stan forum (some of which I have also tried to answer) that go along the lines of “here are my posterior predictive checks of my model, is my model good enough or how can I improve my model if it’s not?” As I see it, there is never an easy answer to this. And in fact, although I have asked myself the same question when doing BDA, it seems that based on the hypothetico-deductive BDA approach, the question isn’t posed quite right. Shouldn’t it be along the lines of, “Is my posterior predictive check severe enough, and if so, do my model predictions deviate from it sufficiently to ‘falsify’ my model, and if so, how do I improve my model (i.e. adjust my hypothesis)”? (I put ‘falsify’ in quotes because we know a priori that the model is always false). This has a different flavor to what seems like some might try to do in terms of just trying to make deviations between pp check and data ‘go away’ (something I have also done). Just as many practical users of Frequentist models may erroneously use it in a Confirmationalist way, I wonder how many practical users of BDA are just motivated to make discrepancies between pp check and data ‘go away’ without thinking about the falsification that it implies. The motive of making changes to the model would seem important, as that motive would direct the types of changes made (for example, I wouldn’t think just any kind of change that improved pp checks would necessarily imply a reasonable improvement from a falsification view, but rather only one where you actually were ‘testing’ a reasonable hypothesis).
So here is the practical BDA question - let’s say I am a psychologist and I have a few causal models, that I have programmed as my generative models in brms or Stan, to analyze some data that I have. Social science is really hard (btw I am not a psychologist, this is all hypothetical), so I have more than one possible generative model. Now let’s say these models are multilevel models, so they are pretty flexible. It’s quite possible that these models would have posterior predictive checks (using the typical suite of graphical check types available in brms or bayesplot) that are remarkably similar. In fact, they all might look pretty good in terms of pp checks! Is this a problem with the severity of the check? If so, how do I make it more severe?
I could compare my models via LOO, but using cross validation doesn’t seem quite the same task as looking for deviations from data and model. CV seems more like model comparison. Can model comparison via CV be falsificationalist? Prediction and model checking seem different…but obviously related. What is the role of LOO in hypothetico-deductive and falsificationalist BDA?
One nice thing about posterior predictive checks is that they can be performed based on arbitrary functions of the data. Careful posterior predictive checking can be a matter of coming up with well-chosen functions that check specific model assumptions in precise ways. Ideally, you would pick a function that yields a predictive check that–when violated–suggests an avenue for model refinement and improvement. For example, you might compute the posterior predictive distribution for a measure of spatial autocorrelation and compare that to the observed to check whether the model is failing to address excess spatial variation.
Here’s a really nice example that develops a suite of functions of the data (here termed “discrepancy functions”) to confront a series of specific model assumptions https://www.pnas.org/doi/10.1073/pnas.1412301112
Somewhat less well known but also useful is the idea of mixed predictive checking, where you choose a function that is a function not just of the data, but also of parameters. For example, if your model includes a Gaussian random effect, and you care about making predictions to new groups, you might construct a test based on, say, the sample skewness of the fitted random effect vector, and ask how that compares to the sample skewness of random effect vectors simulated from their hyperparameters.
Because “all models are wrong”, if you work in settings with lots and lots of data, PPCs should generally fail. To be useful you need to be able to learn something actionable from the failure. Unfortunately, the pressure in the applied academic literature seems to be to select hyper-generic and insensitive functions in order to be able to claim that the PPCs “pass”.
Indeed. This is a great point. I would love to see more examples of this in the wild, where a check is specifically developed for the model at hand. If one is approaching this from the falsification viewpoint, then it would seem that coming up with sensitive and severe check would be as important as coming up with the model itself. I appreciate the link and will read the article (from a skim though, it seems pretty genetics-knowledge intensive (something I lack)).
Nice. That makes a lot of sense. I don’t remember seeing this before.
Yes, exactly. That is really one of the main points I was trying to make to motivate my question about increasing the severity of the pp check.
I understand what you are saying, but is it not an indicator of an ill-formed check? Seems like even with big data one could look at the severity of the check?
Yes, oddly enough, since the goal should be the opposite.
Any idea how/if LOO-CV relates to this in any way? For example, assuming you can’t use pp check very well for big data as you seem to indicate, then what do you turn to that is “actionable”?
Just as one particular example of this, the folk who put together the edstan package include examples of using mixed PPC in the context of item response theory: Two-Parameter Logistic Item Response Model. Otherwise, like you, I don’t know of really any other contexts in which this has been explicitly used, though it certainly seems quite powerful for model checking