Conditional posterior predictions for three-level occupancy model

Hi all,

I have a couple of questions about the three-level occupancy model I am building. The aim of the model is to model the number of DNA sequences encountered of species i (out of a total number of DNA molecules sequenced) conditional on its occupancy at site j, in replicate k. The model looks something like the following, using the a = p \times \phi, b = (1 - p) \times \phi) parameterisation of the beta binomial:

\begin{align} y_{ijk} &\sim \textrm{Beta-binomial}(N_{k}, u_{ijk}p_{ijk}, \phi_{i}) \\ p_{ijk} &= f(X_{reads_{jk}}\beta_{reads_{i}}) \\ u_{ijk} &\sim \textrm{Bernoulli}(z_{ij}\theta_{ijk}) \\ \theta_{ijk} &= f(X_{rep_{jk}}\beta_{rep_{i}}) \\ z_{ij} &\sim \textrm{Bernoulli}(\psi_{ij}) \\ \psi_{ij} &= f(X_{occ_j}\beta_{occ_{i}}) \end{align}

After some considerable effort to understand the maths, I have marginalised out the latent discrete parameters, giving:

\begin{cases} \log \psi_{ij} + \log \theta_{ijk} + \log [y_{ijk} \mid \beta_i, \phi_i] &\text{if } Q_{ijk} = 2 &\\ \text{log_sum_exp} \left( \log \psi_{ij} + \log \theta_{ijk} + \log [y_{ijk} \mid \beta_i, \phi_i], \log (1 - \theta_{ijk}) \right) &\text{if } Q_{ijk} = 1 &\\ \text{log_sum_exp}\left( \text{log_sum_exp} \left( \log \psi_{ij} + \log \theta_{ijk} + \log [y_{ijk} \mid \beta_i, \phi_i], \\\log (1 - \theta_{ijk}) \right), \log (1 - \psi_{ij}) \right) &\text{if } Q_{ijk} = 0 \\ \end{cases}

Where Q_{ijk} is 2 if y_{ijk} > 0, 1 if any y_{ij} > 0, and 0 if all y_{ij} = 0.

First, one of the quantities I would like to work with in the model are the posterior occupancy states (at both the site z_{ij} and sample u_{ijk}) for the sites and samples used in the model. For sites/samples for which we have observed sequences for a species, these are obviously 1, but for samples and sites with no detected sequences, we need to condition these on y_{ijk} = 0 and z_{ij} (for samples), and all y_{ijk} = 0 and u_{ijk} (for sites). It’s reasonably clear how to do this for samples where Q_{ijk} = 1 - applying Bayes’ formula for 3 events, we get the following, which recapitulates nicely the equation for z_i in a standard two-level occupancy model:

P(u_{ijk} = 1 \mid y_{ijk} = 0, z_{ij} = 1) = \\\frac{\theta_{ijk}\cdot\textrm{beta_binomial_lpmf}(0 \mid N_{ijk}, p_{ijk}, \phi_i)}{(1 - \theta_{ijk}) + \theta_{ijk}\cdot\textrm{beta_binomial_lpmf}(0 \mid N_{ijk}, p_{ijk}, \phi_i)} \\\\

For sites with no detections (Q_{ijk} = 0), we can predict both z_{ij} for the site, and u_{ijk} for each sample at the site. However, I am not sure what parameters I need to use to estimate these. Do I first have to calculate P(z_{ij} = 1 \mid \mathbf{y}_{ij} = 0, \mathbf{u}_{ij}), and then use this as P(z_{ij}) when calculating P(u_{ijk} = 1 \mid y_{ijk} = 0, z_{ij})? Or do I want to condition this on the unconditional probability of occupancy P(z_{ij}) = \psi_{ij}? The former would seem to condition twice on y = 0 - once for all y at the site level, and then another time for each sample - does that lead to valid inference?

Second, thinking about the data generatively, what we measure is the number of sequences belonging to each species in a sample, out of a total number of sequences obtained. To model this, the original model that I referenced when building this model ( uses a multinomial likelihood, where the probability of encountering each species in a sample is given by:

\begin{align} y_{ijk} &\sim \text{Multinomial}(N_{k}, p_{ijk}) \\ p_{ijk} &= \frac{u_{ijk}r_{ijk}}{\sum_{m=1}^I u_{mjk}r_{mjk}} \end{align}

Where r_{ijk} represents the relative dominance of a species within a sample, conditional on its presence. This allows the model to account for overdispersion in p_{ijk}, as this parameter is likely to vary depending on the presence or absence of other species in the sample (a species that is present at some very low absolute abundance in all samples would have a relative proportion of 1 if it was the only species in the sample!). Unfortunately the above sum results in combinatorial explosion if you try and marginalise it in this kind of data set with 1000s of species. Thus, I marginalised the above multinomial to a binomial for each species, and then modelled the proportions with a beta distribution (resulting in a beta-binomial) to capture overdispersion in counts resulting from these changes in relative abundance.

Assuming I can estimate the above u_{ijk} and z_{ij} for each sample, would it be valid to fix these in a new model (passing each set of samples as data) to estimate r_{ijk}? The models are not exactly equivalent, which gives me some pause - but it would be very valuable to make these joint predictions of all species at a fixed number of sequences, to account for variation in the number of sequences retrieved (ranging from 1,000 to 100,000 depending on the sample) and the resulting dropout of rare species as a result of under-sampling.

Let’s step through this. You have:

  log(1 - psi[i,j]),  // (1) this captures the z = 0 possibility
  log_sum_exp( // (2) this captures the z = 1 possibility
    log(psi[i,j]) + log(theta[i,j,k]) + observation_likelihood, // (2.1) this captures the u = 1 possibility
    log(1 - theta[i,j,k) // (2.2) this captures the u = 0 possibility

First things first, it looks to me like the line I’ve called (2.2) is missing a log(psi[i,j]) term. Alternatively, you could factor this term out in front of the log_sum_exp in line (2), which is probably more efficient and more clear for inference on z. However, for inference on u it’ll be helpful to keep this term distributed across lines 2.1 and 2.2.

The posterior probability for z is given by \frac{2}{1+2} where 1 and 2 refer to the exponentials of lines 1 and 2. The posterior probability for u is \frac{2.1}{1 + 2.1 + 2.2} = \frac{2.1}{1 + 2} . At the end of the day, un-marginalizing a marginalized discrete model is always just about summing up the likelihood that we are in any given state and then normalizing.

I’m not sure I can be much help on the second part, because I don’t have a feel for what approximations can be gotten away with in this flavor of model. How much info is being thrown out by pretending like the total number of reads is fixed in the data generating process? How squiffy is it to marginalize the multinomial to binomial and ignore the sum-to-N constraint? A couple of observations though. First r_{ijk} will be impossible to estimate when u_{ijk} is zero, or at least it’ll require some flavor of hierarchy. Second, you can approximate r_{ijk} with the binomial parameter from the beta-binomial, right? Inasmuch as this is an unsatisfactory approximation, then I think the beta-binomial model itself must be an unsatisfactory approximation to the model you really want to estimate. But I could well be missing something.

It’s possible I’ve mistranscribed something when translating this into log_sum_exp territory - looking at my original probability-scale marginalisation, the missing \psi_{ij} is there:

\begin{cases} \psi_{ij} \theta_{ijk}[y_{ijk} \mid \beta_i, \phi_i] &\text{if } Q_{ijk} = 2 &\\ \psi_{ij} \left( \theta_{ijk}[y_{ijk} \mid \beta_i, \phi_i] + (1 - \theta_{ijk}) \right) &\text{if } Q_{ijk} = 1 &\\ \psi_{ij} \left( \theta_{ijk}[y_{ijk} \mid \beta_i, \phi_i] + (1 - \theta_{ijk}) \right) + (1 - \psi_{ij}) &\text{if } Q_{ijk} = 0 \end{cases}

As there are multiple samples per site, does the posterior for z (where there are no detections) not include somewhere \prod_{k = 0}^{K}[y_{ijk} = 0 \mid u_{ijk}][u_{ijk}] for all K replicates? This is what motivates my question about estimating u_{ijk} for samples at sites with no detections - does a prediction for any one u_{ijk} also depend on the probabilities of all the other u_{ijk} through their influence on z_{ij}?

These are good questions. By marginalising the multinomial, I switch from estimating the probability that a sequence belongs to each species, to the probability that a sequence belongs to focal species X or not focal species X. In either model, the total N for each sample is the same. In that sense, I have a generative model for sequencing each species, but only in terms of X or not X, and can share information among species hierarchically. To me, the question then becomes whether the beta element of the beta-binomial accurately accounts for variation in the resulting proportions enough that the u_{ijk} are estimated well-enough to switch to the other model. If not, I’m mostly interested in making inference at the u and z levels in any case, so in that sense not being able to switch shouldn’t be too much of a problem.

Ah, makes sense! But I think that means that the marginalizations you gave aren’t what you need, for exactly the same reason, right? Or am I still missing something?

1 Like

I think you’re right - and I see what part of the marginalisation I got wrong. I suspect that when I take another look at this tomorrow, the answer to my question will be much more clear!

1 Like

So I fixed my marginalisation. Letting Q_{ij} = 1 if there is at least 1 detection of species i in a sample at site j, and R_{ijk} = 1 if y_{ijk} > 0:

[y_{ijk} \mid \psi_{ij}, \theta_{ijk}] = \begin{cases} \psi_{ij} \left( \prod_{k = 1}^{K} \theta_{ijk}[y_{ijk} \mid p_{ijk}, u_{ijk} = 1] + \\(1 - \theta_{ijk})I(R_{ijk} = 0) \right)&\text{if } Q_{ijk} = 1 &\\ \psi_{ij} \left( \prod_{k = 1}^{K} \theta_{ijk}[y_{ijk} \mid p_{ijk}, u_{ijk} = 1] + \\(1 - \theta_{ijk}) \right) + (1 - \psi_{ij}) &\text{if } Q_{ijk} = 0 \end{cases}

The equation for P(u_{ijm} = 1 \mid y_{ijm} = 0, z_{ij} = 1), for some replicate m, is the same as I posted above. Now, when Q_{ij} = 0, z_{ij} is:

P(z_{ij} = 1 \mid \mathbf{y}_{ij} = 0, \mathbf{u}_{ij}) =\\ \frac{\psi_{ij} \left( \prod_{k = 1}^{K} \theta_{ijk}[y_{ijk} \mid p_{ijk}, u_{ijk} = 1] + (1 - \theta_{ijk}) \right)}{\psi_{ij} \left( \prod_{k = 1}^{K} \theta_{ijk}[y_{ijk} \mid p_{ijk}, u_{ijk} = 1] + (1 - \theta_{ijk}) \right) + (1 - \psi_{ij}) }

To estimate P(u_{ijm} = 1 \mid y_{ijm} = 0) for some replicate m, I also need to condition on all the other \mathbf{u}_{ij} and \mathbf{y}_{ij}, excuding the focal replicate m (I don’t know how to write that in maths notation!): P(u_{ijm} = 1 \mid y_{ijm} = 0, \mathbf{y}_{ijk, k \neq m} = 0, \mathbf{u}_{ijk, k \neq m}, z_{ij}). That’s a 5-variable Bayes’ theorem expansion at a first look and I’m a bit scared to work it out at the moment - I suspect it can be simplified by substituting a variable for \mathbf{y}_{ijk, k \neq m}, \mathbf{u}_{ijk, k \neq m}, but that’s for another time.

In the mean time, I want to try and implement the above likelihood in Stan. I think the easiest (for conceptual clarity!) way to do this will be to pass an array with the format [y, N, R, site_index, species_index], and slice this by site and species to do the calculations.

However, it’s not quite clear to me how log_sum_exp interacts with a product sign like the one above. I appreciate that the product transforms to a summation, but I’m not sure where the log_sum_exp goes - am I doing this inside or outside of the product?

1 Like

So I thought I had it, but I’m having trouble with the model fits, and I’m not sure if I have missed something when implementing the model, or if there’s some fundamental inidentifiability in the data.

functions {
  real sample_detection_lpmf(int Y, int N, real p, real theta){
    return bernoulli_logit_lpmf(1 | theta) +
      binomial_logit_lpmf(Y | N, p);
  real sample_nondetection_lpmf(int Y, int N, real p, real theta) {
    return log_sum_exp(
      sample_detection_lpmf(Y | N, theta, p), bernoulli_logit_lpmf(0 | theta));
  real sample_occupancy_sum_lpmf(array[,] int Y, vector p, vector theta) {
    int K = size(Y);
    real temp_target = 0;
    for(k in 1:K) {
      if(Y[k, 1] > 0) {
        temp_target += sample_detection_lpmf(Y[k, 1] | Y[k, 2], p[k], theta[k]);
      } else {
        temp_target += sample_nondetection_lpmf(Y[k, 1] | Y[k, 2], p[k], theta[k]);
data {
  int N_sites;        // Number of sites
  int N_replicates;   // Number of replicates
  array[N_replicates, 2] int Y;  // Raw data [Replicate][Y, N]
  array[N_sites] int Q;
  // Different number of replicates per site, so the input replicates are passed in site
  // order - then we can access them per site by looping over the group sizes
  array[N_sites] int rep_group_sizes;

  // Site occupancy predictors
  int N_pred_occ;                    // Number of predictors
  matrix[N_sites, N_pred_occ] X_occ; // Predictor matrix for occupancy
  // Replicate occupancy predictors
  int N_pred_rep;
  matrix[N_replicates, N_pred_rep] X_rep;

  // Read abundance predictors
  int N_pred_reads;
  matrix[N_replicates, N_pred_reads] X_reads;
parameters {
  // Site occupancy parameters
  vector[N_pred_occ] b_occ;

  // Sample occupancy parameters
  vector[N_pred_rep] b_rep;

  // Read abundance parameters
  vector[N_pred_reads] b_reads;
model {
  // Priors
  // Site occupancy
  target += normal_lpdf(b_occ | 0, 3);

  // Sample occupancy
  target += normal_lpdf(b_rep | 0, 3);

  // Read abundance 
  target += normal_lpdf(b_reads | 0, 3);

  vector[N_replicates] p = X_reads * b_reads;
  vector[N_replicates] theta = X_rep * b_rep;
  vector[N_sites] psi = X_occ * b_occ;

  int start=1; //replicates are ordered by site - starting at replicate 1
  for(j in 1:N_sites) {
    int end = start + rep_group_sizes[j] - 1; // add number of reps for site j and subtract 1

    if(Q[j] == 0) {
      target += bernoulli_logit_lpmf(1 | psi[j]) + 
          sample_occupancy_sum_lpmf(Y[start:end,] | p[start:end], theta[start:end]);
    if(Q[j] == 1) {
      target += log_sum_exp(bernoulli_logit_lpmf(1 | psi[j]) + 
          sample_occupancy_sum_lpmf(Y[start:end,] | p[start:end], theta[start:end]),
          bernoulli_logit_lpmf(0 | psi[j]));
    start += rep_group_sizes[j]; // update start point for next site
generated quantities {
  vector[N_replicates] p = X_reads * b_reads;
  vector[N_replicates] theta = X_rep * b_rep;
  vector[N_sites] psi = X_occ * b_occ;

Running the model for a single species, it seems to happily predict \psi for some range of predictors, but it struggles to properly identify \theta and p - for the former, it pretty much unilaterally predicts a posterior probability of occupancy close to zero, and for the latter, the probability of a read being classified as belonging to the species as close to one (x axes on the probability scale, y axes = predicted parameter for each site/replicate):

For reference, the species I tested this on is detected at 5 of 30 sites, and within sites with detections, is naively found in about 6 - 20% of samples at each site. Within each sample with detections, it is present at a naive rate of about 0.003 - 0.25% of sequencing reads returned. However, this is a problem present with all species - fitting the same model hierarchically for all species, all species appear to display this pattern.

I’m really not sure what I’m missing here - I have a suspicion that there’s something wrong with my likelihood, either my implementation in Stan or in the maths above, but I’ve been thinking about it all day and can’t quite figure out what it might be. However, the problem seems to be described pretty exactly by @jsocolar here: How to generate posterior predictive distribution for hierarchical occupancy model in Stan? - #4 by jsocolar. Anyone have any idea what I’ve missed here?

I repeated the above analysis using a simulation, to the same effect: I am definitely sure there’s a problem with my implementation, but I can’t see at all what I’ve missed.

sim_occ_int_1spp <- function(psi, theta, p, N_sites, N_samples, library_size){
  data <- tibble(Site = rep(1:N_sites, each = N_samples),
                 Z = rep(rbinom(N_sites, 1, psi), each = N_samples)) |> 
    rowwise() |> 
    mutate(U = rbinom(1, 1, theta * Z),
           Total = rnbinom(1, mu = library_size, size = 3)) |> 
    mutate(Count = rbinom(1, Total, p * U)) |> 
  Y <- data |> 
    select(Count, Total) |> 
  X_occ <- matrix(1, ncol = 1, nrow = N_sites)
  X_rep <- matrix(1, ncol = 1, nrow = N_sites * N_samples)
  X_reads <- matrix(1, ncol = 1, nrow = N_sites * N_samples)
  Q <- Y_df |> 
    summarise(Q = ifelse(any(Count > 1), 1, 0), .by = Site) |> 
  rep_group_sizes <- rep(N_samples, N_sites)
  stan_data <- list(
        N_sites = N_sites,
        N_replicates = N_sites * N_samples,
        Y = Y,
        Q = Q,
        rep_group_sizes = rep_group_sizes,
        N_pred_occ = ncol(X_occ),
        X_occ = X_occ,
        N_pred_rep = ncol(X_rep),
        X_rep = X_rep,
        N_pred_reads = ncol(X_reads),
        X_reads = X_reads

sim_data <- sim_occ_int_1spp(
    psi = 0.6,
    theta = 0.5,
    p = 0.02,
    N_sites = 30,
    N_samples = 5,
    library_size = 40000

mod <- cmdstanr::cmdstan_model("model.stan")
mod$sample(data = sim_data, chains = 4, parallel_chains = 4)

Assuming a single site occupancy \psi of 0.6, a sample occupancy \theta of 0.5, and a read detection rate p of 0.2:


Edit: found the bug… I transposed the conditions for Q in the likelihood, so the likelihood for Q=1 ran the code for Q=0 and vice versa… 🤦‍♂️

1 Like

Just wanted to pick back up on this point in case anyone comes across this in the future. The above likelihood is actually implementable in Stan without combinatorial explosion, using the fact that a multinomial process can be implemented by calculating separate Poisson rates for each species. For j species, the multinomial probability for species i can be estimated as:

p_{i} = \frac{\lambda_i}{\sum_j \lambda_j}

Where each \lambda is a Poisson rate parameter. For the above likelihood, this means that one can estimate each u_{ijk}r_{ijk} using a Poisson likelihood, and recompose the multinomial form using the above equation in the generated quantities to recapitulate the true probabilities with a sum-to-total constraint.

(whether this will fit well to my data is yet to be seen!)