Hi all. I’m helping some colleagues analyze some count data. However, I can’t get a good model fit and am looking for some advice.
This is a genomic study, but you don’t really need to understand biology to understand the problem. We have sampled a bunch of species, and tallied the number of genes in each gene family. We now want to know the number of genes in a family can be explained by an ecological variable. We are interested in the per-family slopes.
Here’s the variables that actually matter:
- genes: how many genes we found, count data.
- family: to what gene family these genes belong, categorical.
- species: what species this was measured in
- ecology: a continuous predictor variable
The model is straightforward:
genes ~ 1 + ecology + (1 + ecology | family)
(The full model also has parameters for effects of species and phylogeny, but that’s not important here.)
So I expected these counts to be over-dispersed, and fitted a negative binomial model. However, the posterior predictive check looks problematic to me:
(I’ve zoomed in on the lower range here.)
I think the problem here is that our data has a lack of 0 values to confirm to the assumptions of the negative binomial model. This is because a gene family can only be observed if it had at least 1 gene in one of the species. Since the number of genes per family doesn’t vary that much between species, we end up with too few zeros.
- Does the posterior predictive check also look problematic to you?
- Is there way I can account for unobserved zero values? Is there a different family I can use?
Appreciate any help!