Hi all,
I have a couple of questions about the three-level occupancy model I am building. The aim of the model is to model the number of DNA sequences encountered of species i (out of a total number of DNA molecules sequenced) conditional on its occupancy at site j, in replicate k. The model looks something like the following, using the a = p \times \phi, b = (1 - p) \times \phi) parameterisation of the beta binomial:
After some considerable effort to understand the maths, I have marginalised out the latent discrete parameters, giving:
Where Q_{ijk} is 2 if y_{ijk} > 0, 1 if any y_{ij} > 0, and 0 if all y_{ij} = 0.
First, one of the quantities I would like to work with in the model are the posterior occupancy states (at both the site z_{ij} and sample u_{ijk}) for the sites and samples used in the model. For sites/samples for which we have observed sequences for a species, these are obviously 1, but for samples and sites with no detected sequences, we need to condition these on y_{ijk} = 0 and z_{ij} (for samples), and all y_{ijk} = 0 and u_{ijk} (for sites). Itās reasonably clear how to do this for samples where Q_{ijk} = 1 - applying Bayesā formula for 3 events, we get the following, which recapitulates nicely the equation for z_i in a standard two-level occupancy model:
For sites with no detections (Q_{ijk} = 0), we can predict both z_{ij} for the site, and u_{ijk} for each sample at the site. However, I am not sure what parameters I need to use to estimate these. Do I first have to calculate P(z_{ij} = 1 \mid \mathbf{y}_{ij} = 0, \mathbf{u}_{ij}), and then use this as P(z_{ij}) when calculating P(u_{ijk} = 1 \mid y_{ijk} = 0, z_{ij})? Or do I want to condition this on the unconditional probability of occupancy P(z_{ij}) = \psi_{ij}? The former would seem to condition twice on y = 0 - once for all y at the site level, and then another time for each sample - does that lead to valid inference?
Second, thinking about the data generatively, what we measure is the number of sequences belonging to each species in a sample, out of a total number of sequences obtained. To model this, the original model that I referenced when building this model (https://besjournals.onlinelibrary.wiley.com/doi/full/10.1111/2041-210X.13732) uses a multinomial likelihood, where the probability of encountering each species in a sample is given by:
Where r_{ijk} represents the relative dominance of a species within a sample, conditional on its presence. This allows the model to account for overdispersion in p_{ijk}, as this parameter is likely to vary depending on the presence or absence of other species in the sample (a species that is present at some very low absolute abundance in all samples would have a relative proportion of 1 if it was the only species in the sample!). Unfortunately the above sum results in combinatorial explosion if you try and marginalise it in this kind of data set with 1000s of species. Thus, I marginalised the above multinomial to a binomial for each species, and then modelled the proportions with a beta distribution (resulting in a beta-binomial) to capture overdispersion in counts resulting from these changes in relative abundance.
Assuming I can estimate the above u_{ijk} and z_{ij} for each sample, would it be valid to fix these in a new model (passing each set of samples as data) to estimate r_{ijk}? The models are not exactly equivalent, which gives me some pause - but it would be very valuable to make these joint predictions of all species at a fixed number of sequences, to account for variation in the number of sequences retrieved (ranging from 1,000 to 100,000 depending on the sample) and the resulting dropout of rare species as a result of under-sampling.