Multi-species occupancy with predictors

Hello everyone,

Context
I’m attempting to build on Bob Carpenter’s implementation of Dorazio et al’s (2006) multi-species occupancy model. I would like to swap the latent predictors influencing detection and occupancy, for some site- and landscape-level predictors.

Background
I have records for 56 bird species at 101 sites, each site was surveyed 3 times in a single season. I have 16 site and landscape predictors, some linear, some interactions, that will likely influence occupancy. These are the same predictors used in a previous multi-species occupancy model that did not include detection history. I also hope to incorporate predictors for detection, principally Julian day of the survey. The only way I can conceive of doing this is by converting the Species x Site matrix used in the MSOM to a vector of presence-absence, with one row per species per survey.

In the model linked to my previous forum post above, @andrjohns gave some invaluable assistance, converting the multiple predictors into a single matrix and using non-centred parameterisation. I hoped to combine the MSOM model and the No-detection-history model, but I can’t get the code to work. I’m not sure how to combine the matrix of predictors with logit_psi; I’m getting a dimension mismatch. The MSOM model defines this as a vector, but I suspect it needs to be a matrix? Or I need to combine all the individual predictors into a single value somehow?

I don’t have enough Stan experience to work out where I’m going wrong. If anyone is able to help me resolve this, or notices any other errors, I would really appreciate some help, thanks.

Code

// MSOM
// Adapted from https://mc-stan.org/users/documentation/case-studies/dorazio-royle-occupancy.html 
// and https://discourse.mc-stan.org/t/multi-species-occupancy-no-detection-history/27274

// Define some helper functions
// Mostly unchanged from Bob Carpenter's MSOM, except 
// 'lp_observed' has 'Occ' swapped for 'x', and 'bernoulli_lpmf' 
// in place of binomial to account for only single trial.
functions {
  matrix cov_matrix_2d(vector sigma, real rho) {
    matrix[2,2] Sigma;
    Sigma[1,1] = square(sigma[1]);
    Sigma[2,2] = square(sigma[2]);
    Sigma[1,2] = sigma[1] * sigma[2] * rho;
    Sigma[2,1] = Sigma[1,2];
    return Sigma;
  }

  real lp_observed(int Occ, int K, real logit_psi, real logit_theta) {
    return log_inv_logit(logit_psi)
      + bernoulli_lpmf(Occ, logit_theta);
  }
  
 // Do I need to alter this as well? K no longer appropriate? Only 1 trial not 3.
  real lp_unobserved(int K, real logit_psi, real logit_theta) {
    return log_sum_exp(lp_observed(0, K, logit_psi, logit_theta),
                       log1m_inv_logit(logit_psi));
  }

 // And change K here as well?
  real lp_never_observed(int J, int K, real logit_psi, real logit_theta, real Omega) {
      real lp_unavailable = bernoulli_lpmf(0 | Omega);
      real lp_available = bernoulli_lpmf(1 | Omega)
        + J * lp_unobserved(K, logit_psi, logit_theta);
      return log_sum_exp(lp_unavailable, lp_available);
    }
}


data {
  int N_sp; // Number of species (56)
  int J; // Number of sites (101)
  int L; // Number of occupancy predictors (16)
  int<lower=1> K;  // Number of surveys per site (3)
  int N;  // Number of rows in data (N_sp x J x K = 16968)
  int sp [N]; // Vector length N of species ID codes
  int<lower=0, upper=1> Occ [N]; // Vector length N of 0 - 1 occupancy
  matrix[L, N] Mat; // Matrix of occupancy predictors
  int<lower=N_sp> S;  // Superpopulation size
  int Jday [N]; // Vector length N of survey date as Julian day
}


parameters {
  // These are unchanged from the MSOM
  real alpha;  //  site-level occupancy
  real beta;   //  site-level detection
  real<lower=0, upper=1> Omega;  // availability of a species
  real<lower=-1,upper=1> rho_uv;  // correlation of (occupancy, detection)
  vector<lower=0>[2] sigma_uv;    // sd of (occupancy, detection)
  vector[2] uv[S];                // species-level (occupancy, detection)
  
  // This is new, based on @andrjohn's assistance.
  // Estimate species coefficients as a single matrix also
  // with non-centered parameterisation.
  matrix<offset=alpha, multiplier=sigma_uv[1]>[L, N_sp] alpha_slopes;
}

transformed parameters {
  // New
  // Combine the intercept and separate slopes into
  // a single object psi
  row_vector[N] psi = columns_dot_product(Mat, alpha_slopes[:,sp]);
   
  // Unchanged  
  // Defines the logit-scaled versions of psi (occupancy) 
  // and theta (detection) as vectors. 
  // ** Should these be matrices? **
  vector[S] logit_psi;    // log odds of occurrence
  vector[S] logit_theta;  // log odds of detection
  
  // Swapped single parameter 'alpha' for multiple 
  // predictors 'psi'
  for (i in 1:S)
    logit_psi[i] = uv[i,1] + psi;
  
  // Added the Julian day slope term
  for (i in 1:S)
    logit_theta[i] = uv[i,2] + beta * JDay;
}


model {
  // Priors. Unchanged.
  alpha ~ cauchy(0, 2.5);
  beta ~ cauchy(0, 2.5);
  sigma_uv ~ cauchy(0, 2.5);
  (rho_uv + 1) / 2 ~ beta(2, 2);
  uv ~ multi_normal(rep_vector(0, 2), cov_matrix_2d(sigma_uv, rho_uv));
  Omega ~ beta(2,2);
  
  // New 
  // Normal prior for the combined slopes.
  // Need to use `to_vector()` on `alpha_slopes` as the 
  // normal likelihood doesn't have a signature for matrices.
  to_vector(alpha_slopes) ~ normal(alpha, sigma_uv);
  

  // Likelihood. Unchanged
  for (i in 1:N_sp) {
    1 ~ bernoulli(Omega); // observed, so available
      if (Occ[i] > 0)
        target += lp_observed(Occ[i], K, logit_psi[i], logit_theta[i]);
      else
        target += lp_unobserved(K, logit_psi[i], logit_theta[i]);
    }
  for (i in (N_sp + 1):S)
    target += lp_never_observed(J, K, logit_psi[i], logit_theta[i], Omega);
}


2 Likes

@jsocolar

1 Like

Hi @TBL,
For learning about how to code these models in Stan, I strongly recommend beginning with a version of the MSOM that does not involve data-augmentation with never-observed species, and thus does not involve the \Omega parameter. Without data-augmentation, the multispecies likelihood is identical to the single-species likelihood; the models differ only in their random effect structure. I wrote up a tutorial on how to code this model up here, and I think it will be clarifying for you:
https://jsocolar.github.io/marginalizedOccupancy/

Once you have this working, you might think about how to re-incorporate the data-augmented structure. However, a word of warning here: data augmentation often behaves weirdly in the presence of covariates on occupancy. One facet of the problem is this: if the random effect distributions for the species-specific slopes have even moderately large standard deviations, then the model can pack in arbitrary numbers of species in the meta-community by estimating covariate relationships such that the species are unlikely to occur on any point in your dataset. This is particularly true if you allow quadratic relationships between occupancy and covariates. (Edit: to expand on this point because it’s important, the data-augmented metacommunity model is trying to estimate the number of species that occur over some population of sites that is larger than just the sites that were actually sampled. Thus, unless you’re willing to treat the site covariates as random, the model lacks sufficient information about the characteristics of sites in the population to make an informed inference.)

Another facet of the problem is that it becomes impossible to incorporate species traits into the occupancy or detection sub-models (because we don’t have trait values for the never-observed species), but traits and trait-by-environment interactions are often highly predictive in MSOMs.

With that said, I did just push code to estimate exactly your model to the more-models branch of R package flocker. GitHub - jsocolar/flocker at more-models

This code is not well tested, but I think it should work to enable estimation of the model you seek using brms formula syntax (again, I don’t necessarily recommend fitting this model with covariates on occupancy; I wrote this code primarily to enable fitting the model with covariates on detection. If you do fit the model with covariates on occupancy, it is possible that the point-specific species richness estimates will be meaningful, but it is certain that the metacommunity-level richness estimate will not be meaningful). The new function flocker_stancode will output the generated Stan code so you can see what’s going on. The documentation of the functions flock and make_flocker_data should get you started on the right track. You may well encounter errors or bugs, and if so I’ll be happy to troubleshoot them and push fixes. This branch is experimental and is not ready for release.

Here’s some example code that should fit the data-augmented model to toy data with site- and visit-specific covariates. Note that the more-models branch requires the use of the cmdstanr backend; the default backend = "rstan" will error. (Edit: unless you install rstan from the experimental branch on GitHub)

library(flocker) # flocker must be installed from the more-models branch

# Generate example data
d <- example_flocker_data()

# Reformat example data for the data-augmented model input format
obs <- d$obs[d$unit_covs$species == "sp_1", ]

for(i in unique(d$unit_covs$species[d$unit_covs$species != "sp_1"])){
  obs <- abind::abind(obs, d$obs[d$unit_covs$species == i, ], along = 3)
}

unit_covs <- d$unit_covs[d$unit_covs$species == "sp_1", c("uc1", "uc2")]
event_covs <- list()
for(i in seq_along(d$event_covs)){
  event_covs[[i]] <- d$event_covs[[i]][d$unit_covs$species == "sp_1", ]
}
names(event_covs) <- names(d$event_covs)

# Run the flocker pipeline
fd <- make_flocker_data(
  obs, unit_covs, event_covs, 
  type = "augmented", n_aug = 100)


mod <- flock(~ uc1, ~ uc1 + ec1, 
             fd, augmented = T,
             backend = "cmdstanr"
             )