Appropriate count distribution for false detection in occupancy models

I’ve been having quite a bit of trouble with the model as specified above, when testing it on my own data - it pretty much unilaterally predicts occupancy as the proportion of samples with reads, regardless of contamination. Upon reflection, I first thought that I was simply predicting poorly the number of contaminants, and improved the model for prediction (there’s a geometric element to this problem due to the combinatorial nature of the issue), but this didn’t work either. However, in working on this I realised that the model above gets the generative model wrong.

Instead of each sample containing some number of contaminant reads, it’s more correct to think of the total read pool for a taxon as some fixed number. When the tag jumping process occurs, sequences switch out of their original sample into another. Samples with zero sequences for a taxon initially can only gain reads, but samples that initially contained reads can both gain and lose them, with rates proportional to the number of reads available to switch between any pair of tag combinations. Thus, the problem only manifests as ‘contamination’ in samples that originally contained no sequences, and elsewise the number of reads could either be larger or smaller.

Considering this, I have implemented a new model: in a standard occupancy model, we assume that any number of detections (Y_{ij} > 0) guarantees a confirmed presence. Instead, we can model the number of reads required for a confirmed presence using a negative binomial model of the reads from unused combinations, and marginalise over all plausible outcomes. This model gives much more sensible results when run on real data, and I think makes intuitive sense - we cannot be confident a taxon is present until it is higher than some threshold number of reads we see as noise in the data.


The question I have now is how to extend this model to two layers of occupancy. In my original (no false detection) model, samples are clustered within sites, and the predictors for occupancy at the site level differ from those at the within site level. The likelihood looks like the following, where Q_{ijk} is 1 if any sample at a site is occupied and 0 otherwise:

[y_{ijk} \mid \psi_{ij}, \theta_{ijk}] = \begin{cases} \psi_{ij} \left( \prod_{k = 1}^{K} \theta_{ijk}[y_{ijk} \mid p_{ijk}, u_{ijk} = 1] + \\(1 - \theta_{ijk})I(R_{ijk} = 0) \right)&\text{if } Q_{ijk} = 1 &\\ \psi_{ij} \left( \prod_{k = 1}^{K} \theta_{ijk}[y_{ijk} \mid p_{ijk}, u_{ijk} = 1] + \\(1 - \theta_{ijk}) \right) + (1 - \psi_{ij}) &\text{if } Q_{ijk} = 0 \end{cases}

However, I’m not quite sure how to work the marginalisation. There seem to be three possibilities:

  1. Marginalise for each sample individually, and then use only the likelihood for Q = 0. I tested this first and it does not fit well.
  2. Marginalise over the whole likelihood, using a single loop over all samples.
  3. Marginalise over the whole likelihood, but each sample is marginalised individually and thus we consider all possible combinations of the threshold across each set of samples.

I think 2 would be workable, if slow, but three obviously would be intractable (some sites have 45 samples). My intuition varies a little depending on the moment, but my suspicion is that 3) is probably correct, given that the samples are independent within sites?