Hi all,
I’ve been struggling with how to formulate a capture-recapture model that suits my purposes and hope the Stan community can steer me towards a solution. The data I have were collected from several populations (of fish) that occupy different locations. The nature of my data (and the question I want to answer) is such that I’m willing to treat each of these locations as separate and discrete with no individuals entering or exiting each location during the sampling, i.e. each (sub)population is closed. I would like an estimate of how many individuals occupy each of those locations, which means that my model needs some additional wrinkles beyond the most readily available Stan capture-recapture models. In other words, I want to estimate the individual (sub)population sizes,
N_1, N_2, ... N_k,
rather than
N = \sum_{i =1}^{k}N_i
The closest formulation I’ve seen occurs in this forum post, Capture-Recapture model with partial or complete pooling. The Stan code linked in that topic treats N as a (potentially) multi-dimensional array and the model statement loops over the different indices. That’s a valid approach, but seems a bit inefficient and I’d rather be able to pass the population variables into the model using the perhaps more familiar (G)LM-type approach with a design matrix and vector(s) of parameters for each column in the design matrix. Eventually, I’d like to incorporate partial pooling, but that’s more of a wish list item at this point and I digress…
I decided to build the model from the ground-up drawing upon a couple of excellent resources, namely Hiroki Ito’s Stan translations of the Bayesian Population Analysis book and @mbjoseph translations of Spatial Capture-Recapture, especially the model that treats sex as an individual covariate. To begin, I thought I’d create a simple simulation and model with 2 populations of different size and different detection probabilities.
I took a data augmentation approach to model fitting, which introduces some challenges with respect to estimating the (sub)population sizes, since individuals that are not captured but which occur in a (sub)population need to be assigned to a (sub)population. With only 2 populations, I was able to borrow the approach taken for using sex as an individual covariate, substituting (sub)population for sex. The model fits the simulated data reasonably well, but there’s some bias in the (sub)population estimates, which is typical for this number of individuals, sample events and detection probabilities. So it seems like the model works!
However, I could quickly imagine this approach getting out of hand, since one would need a slew of (nested?) calls to log_mix
for all the (pairs of) populations. My data has 24 locations with (sub)populations, so repeated calls to log_mix
feels like some combination of obnoxious and intractable, at least given the implementation below.
Any tips, tricks, suggestions and improvements are welcome. I’m definitely in the steep part of the learning curve in building models in Stan and am happy to draw upon the collective wisdom of the crowd here!
Thanks!
The model code is:
data {
int<lower = 1> n_ind; // Size of augumented data set
int<lower = 0> n_aug; // Number of augmented individuals
int<lower = 0> n_obs; // Size of observed data set
int<lower = 1, upper = 2> pop[n_obs]; // Population 1 or 2
int<lower=0> n_occasion; // Number of sampling occasions
int<lower=0,upper=1> y[n_ind, n_occasion]; // Capture-history matrix
}
transformed data {
int<lower = 0, upper = 1> observed[n_ind];
int<lower = 0, upper = 1> is_pop2[n_obs];
for (i in 1:n_ind) {
if (sum(y[i]) > 0) {
observed[i] = 1;
} else {
observed[i] = 0;
}
}
for (i in 1:n_obs) {
is_pop2[i] = pop[i] - 1;
}
}
parameters {
real logit_in_pop; // Inclusion probability
real b; // population-level effects on detection probability
real Intercept; // intercept for detection probability
real logit_in_pop2; // logit of being in pop. subset
}
transformed parameters {
real <lower = 0, upper = 1> in_pop = inv_logit(logit_in_pop);
real <lower = 0, upper = 1> in_pop2 = inv_logit(logit_in_pop2);
vector[n_ind] lp_if_present;
vector[n_ind] log_lik;
for (i in 1:n_ind){
if (observed[i]){
lp_if_present[i] = bernoulli_lpmf(1 | in_pop) // z[i] ==1
+ bernoulli_lpmf(is_pop2 | in_pop2) // probability individual is in pop2
+ bernoulli_logit_lpmf(y[i]| Intercept + b * is_pop2[i]); // prob. detection given present
log_lik[i] = lp_if_present[i]; // Probability not in population
} else {
// augmented individuals, unknown popoulation
lp_if_present[i] = bernoulli_lpmf(1 | in_pop)
+ log_mix(in_pop2, // P(in pop 2)
bernoulli_logit_lpmf(y[i] | Intercept + b), // Pop 2
bernoulli_logit_lpmf(y[i] | Intercept) // Pop 1
);
log_lik[i] = log_sum_exp(lp_if_present[i], bernoulli_lpmf(0 | in_pop));
}
}
}
model {
// Priors
logit_in_pop ~ normal(0, 2.5);
Intercept ~ normal(0, 2.5);
b ~ normal(0, 1);
logit_in_pop2 ~ normal(0, 2.5);
// Likelihood
target += sum(log_lik);
}
generated quantities {
int N;
int N1;
int N2;
{
vector[n_ind] lp_present; // [z = 1][y = 0 | z = 1] / [y = 0] on log-scale
int z[n_ind];
int z1[n_ind];
int z2[n_ind];
for (i in 1:n_ind) {
if (observed[i]) {
z[i] = 1;
z1[i] = is_pop2[i] == 0;
z2[i] = is_pop2[i];
} else {
lp_present[i] = lp_if_present[i] - log_lik[i];
z[i] = bernoulli_rng(exp(lp_present[i]));
z1[i] = z[i] * bernoulli_rng(1-in_pop2);
z2[i] = z[i] * (1 - z1[i]);
}
}
N = sum(z);
N1 = sum(z1);
N2 = sum(z2);
}
}
The code I used to simulate the data and package things for Stan:
N_1 <- 500 # Pop 1 size
N_2 <- 700 # Pop 2 size
N <- N_1 + N_2 # Total pop
mean.p <- 0.67 # Detection Prob. 1
beta.p <- -1 # Logit difference for detection prob. 2
T <- 3 # Number of sampling events
# Create the mark-recpature dataset
data_mmulti <- tibble(id = c(rep(1:N_1, each = T), rep(1:N_2, each = T)),
pop = rep(c(rep(1, N_1), rep(2, N_2)), each = T),
samp = rep(1:T, N),
mean.p = mean.p,
beta.p = beta.p) %>%
mutate(p = plogis(log(mean.p / (1-mean.p)) + beta.p*(pop-1)),
pop = factor(pop)) %>%
rowwise() %>%
mutate(y = rbinom(1, 1, p)) %>%
select(id, pop, samp, y) %>%
pivot_wider(id_cols = c(id, pop),
names_from = samp,
names_prefix = 'samp',
values_from = y)
# Just the detection histories
yfull <- data_mmulti %>%
select(starts_with('samp'))
# Only individuals that were detected at least 1 time
ever.detected <- yfull %>%
apply(1, max)
n_obs <- sum(ever.detected) # individuals observed
yobs <- yfull[ever.detected ==1,] # Detection histories
# Population covariate for observed individuals
pop_obs <- data_mmulti %>%
mutate(count = samp1 + samp2 + samp3) %>%
filter(count != 0) %>%
pull(pop) %>%
as.numeric()
# Augment population by nz individuals
n_aug <- N_1*2 + N_2*2
yaug <- rbind(simplify2array(yobs), array(0, dim = c(n_aug, T)))
n_ind <- nrow(yaug) # Number of individuals after augmentation
n_occasion <- ncol(yaug) # Number of samples
# Package data for Stan
stan_data = list(y = yaug, n_ind = n_ind, n_aug = n_aug,
n_obs=n_obs, n_occasion = n_occasion, pop = pop_obs)
# parameters to monitor
params <- c('Intercept', 'b', 'logit_in_pop', 'in_pop', 'logit_in_pop2', 'in_pop2', 'N', 'N1', 'N2')
# Initial values for parameters
inits <- function(){
list(Intercept = rnorm(1, 0, 2.5),
b = rnorm(1, 0, 1),
logit_in_pop = rnorm(1, 0, 2.5),
logit_in_pop2 = rnorm(1, 0, 1))}
# HMC settings
ni <- 2000 # iterations
nt <- 1 # thinning
nb <- 1000 # warm-up
nc <- 4 # chains
# Run model
out_mmulti <- stan("Mmulti.stan",
data = stan_data, init = inits, pars = params,
chains = nc, iter = ni, warmup = nb, thin = nt,
seed = 2,
open_progress = FALSE)