Appropriate count distribution for false detection in occupancy models

prototaxites · October 31, 2023, 1:40pm

I’ve been having quite a bit of trouble with the model as specified above, when testing it on my own data - it pretty much unilaterally predicts occupancy as the proportion of samples with reads, regardless of contamination. Upon reflection, I first thought that I was simply predicting poorly the number of contaminants, and improved the model for prediction (there’s a geometric element to this problem due to the combinatorial nature of the issue), but this didn’t work either. However, in working on this I realised that the model above gets the generative model wrong.

Instead of each sample containing some number of contaminant reads, it’s more correct to think of the total read pool for a taxon as some fixed number. When the tag jumping process occurs, sequences switch out of their original sample into another. Samples with zero sequences for a taxon initially can only gain reads, but samples that initially contained reads can both gain and lose them, with rates proportional to the number of reads available to switch between any pair of tag combinations. Thus, the problem only manifests as ‘contamination’ in samples that originally contained no sequences, and elsewise the number of reads could either be larger or smaller.

Considering this, I have implemented a new model: in a standard occupancy model, we assume that any number of detections (Y_{ij} > 0) guarantees a confirmed presence. Instead, we can model the number of reads required for a confirmed presence using a negative binomial model of the reads from unused combinations, and marginalise over all plausible outcomes. This model gives much more sensible results when run on real data, and I think makes intuitive sense - we cannot be confident a taxon is present until it is higher than some threshold number of reads we see as noise in the data.

The question I have now is how to extend this model to two layers of occupancy. In my original (no false detection) model, samples are clustered within sites, and the predictors for occupancy at the site level differ from those at the within site level. The likelihood looks like the following, where Q_{ijk} is 1 if any sample at a site is occupied and 0 otherwise:

[y_{ijk} \mid \psi_{ij}, \theta_{ijk}] = \begin{cases} \psi_{ij} \left( \prod_{k = 1}^{K} \theta_{ijk}[y_{ijk} \mid p_{ijk}, u_{ijk} = 1] + \\(1 - \theta_{ijk})I(R_{ijk} = 0) \right)&\text{if } Q_{ijk} = 1 &\\ \psi_{ij} \left( \prod_{k = 1}^{K} \theta_{ijk}[y_{ijk} \mid p_{ijk}, u_{ijk} = 1] + \\(1 - \theta_{ijk}) \right) + (1 - \psi_{ij}) &\text{if } Q_{ijk} = 0 \end{cases}

However, I’m not quite sure how to work the marginalisation. There seem to be three possibilities:

Marginalise for each sample individually, and then use only the likelihood for Q = 0. I tested this first and it does not fit well.
Marginalise over the whole likelihood, using a single loop over all samples.
Marginalise over the whole likelihood, but each sample is marginalised individually and thus we consider all possible combinations of the threshold across each set of samples.

I think 2 would be workable, if slow, but three obviously would be intractable (some sites have 45 samples). My intuition varies a little depending on the moment, but my suspicion is that 3) is probably correct, given that the samples are independent within sites?

Topic		Replies	Views
Conditional posterior predictions for three-level occupancy model Modeling ecology	8	575	August 18, 2023
Occupancy modeling in Stan Publicity ecology	7	2288	November 3, 2021
Multi-species occupancy - no detection history Modeling techniques , fitting-issues , performance	11	1144	May 5, 2022
A guide to Bayesian Model Checking for ecologists Publicity ecology	6	1630	July 20, 2018
Posterior predictive distribution for occupancy models General ecology	9	1695	August 23, 2021

Appropriate count distribution for false detection in occupancy models

Related topics