Automated posterior predictive checks


I am testing a hypothesis on the distribution of distances between transcription factor binding sites and genes in the human genome. I have a model that fits well for several binding motifs I picked randomly and now I want to test whether it can fit all motifs I am interested in (several hundreds). During development, I checked the fits by hand by visually inspecting posterior predictive check (PPC) plots for density and few functions (median, max, min, IQR). Now I need to automate these checks.

The model for each motif has 5 parameters, while there are 5e3 - 1e5 datapoints for each motif, so it seems that fitting a hierarchical model would be an overkill and I thus fit each motif separately, which has reasonable running time.

So my idea is that for each motif I do a PPC of ~50 quantiles and than look at the distribution of the actual quantiles of the data within the PPC quantiles. Then I check, if there are
a) quantiles where the distribution of data values across motifs is non-uniform
b) motifs where the distribution of data values across quantiles is non-uniform
I might even calculate some p-values on that (whoa)…

Does that sound reasonable or is it a footgun? Has a similar approach been formalized somewhere? I’ve read a paper (can’t find the reference) where they do a similar thing but to test whether the model works when fitting simulated data (I thing they transform the uniformity test to a normality test). Is there a reason this approach might not be valid for testing model fit to actual data ? Thanks for any hints!

Model selection as a decision (without focus on prediction)

So an update: turns out this approach (at least in its basic form) is not suitable for my case. The distribution of the “quantiles of quantiles” is very frequently non-uniform, even for cases, I would consider a good fit. For example, here is a density PPC:

Looks good to me! (and PPCs for quantities of interest such as min, max or sd are also good]).

Now let’s have a look at the relation between quantiles of replicates and quantiles in the real data:

So the quantiles of the actual data are very likely to lie in the middle of the replicated quantiles and the distribution is non-uniform.

One might argue that the quantiles being concentrated at the center is a good thing and in a sense better than uniform, but I have (so far) trouble formalizing this idea. Will update here if I make further progress.