Multivariate probit occupancy models

Hi all,

I’m trying to implement the multivariate probit occupancy model in Stan (see Tobler et al. 2019 and Dorazio et al 2025). Occupancy models estimate site occupancy, whether species s \in 1:S occur at sites i \in 1:I, using repeated surveys at each site j \in 1:J_i. If the species is detected at least once, we know for sure the site is occupied by the species (z_{is} = 1), but if we don’t detect the species, we’re actually not sure if the site is occupied, and in Stan we have to marginalise over the two possibilities. In essense, multispecies occupancy models have multivariate counts (how many of each species were detected during each survey y_{ijs}), with zero-inflation modeled at the level of the site.

The classical logistic occupancy models with Poisson observation model would be:

\begin{aligned} z_{is} &\sim \mathrm{Bernoulli} \left( \psi_{is} \right) \\ y_{ijs} &\sim \mathrm{Poisson} \left( z_{is} \cdot \mu_{ijs} \right), \end{aligned}

where \psi are occupancy probabilities modeled at the level of site and species, and y_{ijs} are counts observed during each survey and \mu_{ijs} are the detection rates. When z_{is} = 0, no detections are possible and thus y_{ijs} = 0 \forall j. In Stan (pseudo-ish) code, this model marginalises over the two possible occupancy states:

for (i in 1:I) {
  for (s in 1:S) {
    // certainly occupied
    if (sum(y[i, :, s] > 0) {
      target += log(psi[s, i]) + poisson_lpmf(y[i, :, s] | mu[s, i]);
      // possibly occupied
    } else {
      target += log_sum_exp(log1m(psi[s, i]),
                            log(psi[s, i]) + poisson_lpmf(y[i, :, s] | mu[s, i]));
    }
  }
}

A related occupancy model is the multivariate probit, where some unobserved latent vector \boldsymbol{u}_i = \left(u_{i1}, \dots, u_{iS} \right)' is modeled as:

\boldsymbol{u}_i \sim \mathrm{Normal} \left(\boldsymbol{X}_i \boldsymbol{\beta}, \boldsymbol{R} \right),

where \boldsymbol{X} and \boldsymbol{\beta} are predictors and coefficients, respectively, and \boldsymbol{R} is a correlation matrix. The species-level correlations of site occupancy are of primary interest here. Occupancy state is deterministic as z_{is} = \mathbb{I} \left(u_{is} > 0 \right).

Now, I’m not sure how this would look in Stan. I believe this model is basically a copula model on the latent occupancy state, but I’m not sure how to do this (I can barely get my head around copulas when the data are directly observed). The Stan User’s Guide has a good section on multivariate probit regression; however, I’m not sure how to adapt this to a latent binary outcome (occupancy).

A “simpler” way to do this would be to stick with the first logistic model and add correlated residuals to the logits:

data {
  int I, S, N;  // number of sites, species, and predictors
  matrix[N, I];  // predictors
}
parameters {
  matrix[S, N] beta;  // regression coefficients
  cholesky_factor_corr[S] R;  // species correlations
  matrix[S, I] z;  // standard normal variates
}
transformed parameters {
  matrix[S, I] logit_psi = beta * X + R * z;
}
model {
  beta ~ some_prior();
  to_vector(z) ~ std_normal_lpdf();
}

And then proceed with the logistic occupancy model. This is how I’ve currently been handling correlated residuals, as it’s much faster than the copula approach (but I think there are issues). However, I’d like to learn how to fit the probit model in Stan.

I might ping a few resident occupancy and copula experts: @jsocolar, @spinkney, @Dalton, @andrjohns.

Thanks for reading,

Matt

1 Like

How many species do you have? Maybe I’m being uncreative, but I don’t see a good way to use the Albert-Chib trick when the binary state is latent and unknown–I think this will require evaluating a multivariate normal CDF for each site, where the MVN has dimensionality equal to the number of species. Whether this is computationally feasible will depend crucially on the number of species.

Latent factor models, often referred to in the ecological literature as Joint Models or JSDMS, can be thought of as reduced-rank approximations to the full covariance matrix over species. Despite some problems with identifiability (some factor loadings need to be fixed–at least their signs need to be fixed–in order to fix the orientation of the coordinate system defined by the latent factors), these have been successfully fit in Stan. Separately, I know some people have successfully piggy-backed occupancy model likelihoods onto the linear predictors defined by these models. This might be a useful alternative for modeling between-species covariances when you don’t have the firepower to compute high-dimensional MVN cdfs and/or when you don’t have enough data to estimate the full-rank covariance matrix.

I usually have 10–20 species. Do you think the multivariate probit is just difficult to fit in Stan, or more generally? Because the JAGS/NIMBLE code is quite straightforward.

I think this should be quite workable with 10 species. With 20, it might get out of hand, unless detection rates are moderately high (or higher), or unless there’s a nifty trick to apply that I don’t know about. The last MSOM I fit had 1614 species, so in general I get nervous about full-rank covariance matrices in SDMs!

Instead of using the Albert-Chib trick given in the users guide, note that you can use the multivariate normal CDF to explicitly compute the probability of each of the 2^S joint occupancy states for each site, conditional on the MVN parameters. Then marginalize over these. Further note that for each species detected at a site, the number of states you need to marginalize over drops by half. So really the number of terms in the marginalization is 2^{S_n}, where S_n is the number of undetected species at the site.

Just for illustration, with my 1614 species, there would be roughly 10^{486} terms in this marginalization at a site where no species was detected! When you fit something like this in JAGS or Nimble, the sampler is guaranteed to never visit the overwhelming majority of the possible states. Convergence in such a context is potentially a bit of a gamble, though these tools seem to work ok in practice as far as I can tell.

1 Like

@CerulloE has worked a lot on these models. He has a wip where he has put a lot of effort into fitting multivariate probit models fast with HMC.

He has a wip where he has put a lot of effort into fitting multivariate probit models fast with HMC.

These models are different–they assume the binary outcome is observed (in the latent class MVP mentioned on this repo the classes are latent but the outcome is still observed). In @mhollanders case, the binary outcome is itself latent. Unless I’m missing something important, this requires a different modeling paradigm.

Thanks for your thoughts everyone. I think I’m probably going to leave it for now. I had a go at adapting @spinkney’s copula code but I don’t really know how to do it for a marginalised random variable. Thanks anyway!