Hi all,
I’m trying to implement the multivariate probit occupancy model in Stan (see Tobler et al. 2019 and Dorazio et al 2025). Occupancy models estimate site occupancy, whether species s \in 1:S occur at sites i \in 1:I, using repeated surveys at each site j \in 1:J_i. If the species is detected at least once, we know for sure the site is occupied by the species (z_{is} = 1), but if we don’t detect the species, we’re actually not sure if the site is occupied, and in Stan we have to marginalise over the two possibilities. In essense, multispecies occupancy models have multivariate counts (how many of each species were detected during each survey y_{ijs}), with zero-inflation modeled at the level of the site.
The classical logistic occupancy models with Poisson observation model would be:
where \psi are occupancy probabilities modeled at the level of site and species, and y_{ijs} are counts observed during each survey and \mu_{ijs} are the detection rates. When z_{is} = 0, no detections are possible and thus y_{ijs} = 0 \forall j. In Stan (pseudo-ish) code, this model marginalises over the two possible occupancy states:
for (i in 1:I) {
for (s in 1:S) {
// certainly occupied
if (sum(y[i, :, s] > 0) {
target += log(psi[s, i]) + poisson_lpmf(y[i, :, s] | mu[s, i]);
// possibly occupied
} else {
target += log_sum_exp(log1m(psi[s, i]),
log(psi[s, i]) + poisson_lpmf(y[i, :, s] | mu[s, i]));
}
}
}
A related occupancy model is the multivariate probit, where some unobserved latent vector \boldsymbol{u}_i = \left(u_{i1}, \dots, u_{iS} \right)' is modeled as:
where \boldsymbol{X} and \boldsymbol{\beta} are predictors and coefficients, respectively, and \boldsymbol{R} is a correlation matrix. The species-level correlations of site occupancy are of primary interest here. Occupancy state is deterministic as z_{is} = \mathbb{I} \left(u_{is} > 0 \right).
Now, I’m not sure how this would look in Stan. I believe this model is basically a copula model on the latent occupancy state, but I’m not sure how to do this (I can barely get my head around copulas when the data are directly observed). The Stan User’s Guide has a good section on multivariate probit regression; however, I’m not sure how to adapt this to a latent binary outcome (occupancy).
A “simpler” way to do this would be to stick with the first logistic model and add correlated residuals to the logits:
data {
int I, S, N; // number of sites, species, and predictors
matrix[N, I]; // predictors
}
parameters {
matrix[S, N] beta; // regression coefficients
cholesky_factor_corr[S] R; // species correlations
matrix[S, I] z; // standard normal variates
}
transformed parameters {
matrix[S, I] logit_psi = beta * X + R * z;
}
model {
beta ~ some_prior();
to_vector(z) ~ std_normal_lpdf();
}
And then proceed with the logistic occupancy model. This is how I’ve currently been handling correlated residuals, as it’s much faster than the copula approach (but I think there are issues). However, I’d like to learn how to fit the probit model in Stan.
I might ping a few resident occupancy and copula experts: @jsocolar, @spinkney, @Dalton, @andrjohns.
Thanks for reading,
Matt