Conditional posterior predictions for three-level occupancy model

prototaxites · August 11, 2023, 4:18pm

So I thought I had it, but I’m having trouble with the model fits, and I’m not sure if I have missed something when implementing the model, or if there’s some fundamental inidentifiability in the data.

functions {
  real sample_detection_lpmf(int Y, int N, real p, real theta){
    return bernoulli_logit_lpmf(1 | theta) +
      binomial_logit_lpmf(Y | N, p);
  }
  
  real sample_nondetection_lpmf(int Y, int N, real p, real theta) {
    return log_sum_exp(
      sample_detection_lpmf(Y | N, theta, p), bernoulli_logit_lpmf(0 | theta));
  }
  
  real sample_occupancy_sum_lpmf(array[,] int Y, vector p, vector theta) {
    int K = size(Y);
    real temp_target = 0;
    
    for(k in 1:K) {
      if(Y[k, 1] > 0) {
        temp_target += sample_detection_lpmf(Y[k, 1] | Y[k, 2], p[k], theta[k]);
      } else {
        temp_target += sample_nondetection_lpmf(Y[k, 1] | Y[k, 2], p[k], theta[k]);
      }
    }
    
    return(temp_target);
  }
}
data {
  int N_sites;        // Number of sites
  int N_replicates;   // Number of replicates
  array[N_replicates, 2] int Y;  // Raw data [Replicate][Y, N]
  array[N_sites] int Q;
  
  // Different number of replicates per site, so the input replicates are passed in site
  // order - then we can access them per site by looping over the group sizes
  array[N_sites] int rep_group_sizes;

  // Site occupancy predictors
  int N_pred_occ;                    // Number of predictors
  matrix[N_sites, N_pred_occ] X_occ; // Predictor matrix for occupancy
  
  // Replicate occupancy predictors
  int N_pred_rep;
  matrix[N_replicates, N_pred_rep] X_rep;

  // Read abundance predictors
  int N_pred_reads;
  matrix[N_replicates, N_pred_reads] X_reads;
}
parameters {
  // Site occupancy parameters
  vector[N_pred_occ] b_occ;

  // Sample occupancy parameters
  vector[N_pred_rep] b_rep;

  // Read abundance parameters
  vector[N_pred_reads] b_reads;
}
model {
  // Priors
  // Site occupancy
  target += normal_lpdf(b_occ | 0, 3);

  // Sample occupancy
  target += normal_lpdf(b_rep | 0, 3);

  // Read abundance 
  target += normal_lpdf(b_reads | 0, 3);

  vector[N_replicates] p = X_reads * b_reads;
  vector[N_replicates] theta = X_rep * b_rep;
  vector[N_sites] psi = X_occ * b_occ;

  int start=1; //replicates are ordered by site - starting at replicate 1
  for(j in 1:N_sites) {
    int end = start + rep_group_sizes[j] - 1; // add number of reps for site j and subtract 1

    if(Q[j] == 0) {
      target += bernoulli_logit_lpmf(1 | psi[j]) + 
          sample_occupancy_sum_lpmf(Y[start:end,] | p[start:end], theta[start:end]);
    }
    if(Q[j] == 1) {
      target += log_sum_exp(bernoulli_logit_lpmf(1 | psi[j]) + 
          sample_occupancy_sum_lpmf(Y[start:end,] | p[start:end], theta[start:end]),
          bernoulli_logit_lpmf(0 | psi[j]));
    }
    start += rep_group_sizes[j]; // update start point for next site
  }
}
generated quantities {
  vector[N_replicates] p = X_reads * b_reads;
  vector[N_replicates] theta = X_rep * b_rep;
  vector[N_sites] psi = X_occ * b_occ;
}

Running the model for a single species, it seems to happily predict \psi for some range of predictors, but it struggles to properly identify \theta and p - for the former, it pretty much unilaterally predicts a posterior probability of occupancy close to zero, and for the latter, the probability of a read being classified as belonging to the species as close to one (x axes on the probability scale, y axes = predicted parameter for each site/replicate):

For reference, the species I tested this on is detected at 5 of 30 sites, and within sites with detections, is naively found in about 6 - 20% of samples at each site. Within each sample with detections, it is present at a naive rate of about 0.003 - 0.25% of sequencing reads returned. However, this is a problem present with all species - fitting the same model hierarchically for all species, all species appear to display this pattern.

I’m really not sure what I’m missing here - I have a suspicion that there’s something wrong with my likelihood, either my implementation in Stan or in the maths above, but I’ve been thinking about it all day and can’t quite figure out what it might be. However, the problem seems to be described pretty exactly by @jsocolar here: How to generate posterior predictive distribution for hierarchical occupancy model in Stan? - #4 by jsocolar. Anyone have any idea what I’ve missed here?

Topic		Replies	Views
Posterior predictive distribution for occupancy models General ecology	9	1686	August 23, 2021
Multimodality in hierarchical population parameters Modeling fitting-issues , specification	16	552	April 30, 2024
Reduce_sum fails to start sampling Modeling performance , paralellization	9	496	August 2, 2023
Blog posts on ecological modeling in Stan Modeling	6	237	February 11, 2025
Multi-scale/Staggered Entry Occupancy model as HMM Modeling ecology , hmm	5	1434	April 16, 2020

Conditional posterior predictions for three-level occupancy model

Related topics