Dealing with unobserved groups with count data

Hi all. I’m helping some colleagues analyze some count data. However, I can’t get a good model fit and am looking for some advice.

This is a genomic study, but you don’t really need to understand biology to understand the problem. We have sampled a bunch of species, and tallied the number of genes in each gene family. We now want to know the number of genes in a family can be explained by an ecological variable. We are interested in the per-family slopes.

Here’s the variables that actually matter:

  • genes: how many genes we found, count data.
  • family: to what gene family these genes belong, categorical.
  • species: what species this was measured in
  • ecology: a continuous predictor variable

The model is straightforward:

genes ~ 1 + ecology + (1 + ecology | family)

(The full model also has parameters for effects of species and phylogeny, but that’s not important here.)

So I expected these counts to be over-dispersed, and fitted a negative binomial model. However, the posterior predictive check looks problematic to me:

Rplot

(I’ve zoomed in on the lower range here.)

I think the problem here is that our data has a lack of 0 values to confirm to the assumptions of the negative binomial model. This is because a gene family can only be observed if it had at least 1 gene in one of the species. Since the number of genes per family doesn’t vary that much between species, we end up with too few zeros.

  • Does the posterior predictive check also look problematic to you?
  • Is there way I can account for unobserved zero values? Is there a different family I can use?

Appreciate any help!

I figured out a hurdle Poisson model fixes this issue.

Unfortunately, the negative binomial model already took a long time to fit, and fitting the hurdle Poisson model on the full dataset is not feasible for us.