Closed population capture-recapture model on multiple populations

Hi all,

I’ve been struggling with how to formulate a capture-recapture model that suits my purposes and hope the Stan community can steer me towards a solution. The data I have were collected from several populations (of fish) that occupy different locations. The nature of my data (and the question I want to answer) is such that I’m willing to treat each of these locations as separate and discrete with no individuals entering or exiting each location during the sampling, i.e. each (sub)population is closed. I would like an estimate of how many individuals occupy each of those locations, which means that my model needs some additional wrinkles beyond the most readily available Stan capture-recapture models. In other words, I want to estimate the individual (sub)population sizes,

N_1, N_2, ... N_k,
rather than

N = \sum_{i =1}^{k}N_i

The closest formulation I’ve seen occurs in this forum post, Capture-Recapture model with partial or complete pooling. The Stan code linked in that topic treats N as a (potentially) multi-dimensional array and the model statement loops over the different indices. That’s a valid approach, but seems a bit inefficient and I’d rather be able to pass the population variables into the model using the perhaps more familiar (G)LM-type approach with a design matrix and vector(s) of parameters for each column in the design matrix. Eventually, I’d like to incorporate partial pooling, but that’s more of a wish list item at this point and I digress…

I decided to build the model from the ground-up drawing upon a couple of excellent resources, namely Hiroki Ito’s Stan translations of the Bayesian Population Analysis book and @mbjoseph translations of Spatial Capture-Recapture, especially the model that treats sex as an individual covariate. To begin, I thought I’d create a simple simulation and model with 2 populations of different size and different detection probabilities.

I took a data augmentation approach to model fitting, which introduces some challenges with respect to estimating the (sub)population sizes, since individuals that are not captured but which occur in a (sub)population need to be assigned to a (sub)population. With only 2 populations, I was able to borrow the approach taken for using sex as an individual covariate, substituting (sub)population for sex. The model fits the simulated data reasonably well, but there’s some bias in the (sub)population estimates, which is typical for this number of individuals, sample events and detection probabilities. So it seems like the model works!

However, I could quickly imagine this approach getting out of hand, since one would need a slew of (nested?) calls to log_mix for all the (pairs of) populations. My data has 24 locations with (sub)populations, so repeated calls to log_mix feels like some combination of obnoxious and intractable, at least given the implementation below.

Any tips, tricks, suggestions and improvements are welcome. I’m definitely in the steep part of the learning curve in building models in Stan and am happy to draw upon the collective wisdom of the crowd here!

Thanks!


The model code is:

data {
  int<lower = 1> n_ind;               // Size of augumented data set
  int<lower = 0> n_aug;         // Number of augmented individuals
  int<lower = 0> n_obs;             // Size of observed data set
  int<lower = 1, upper = 2> pop[n_obs]; // Population 1 or 2
  int<lower=0> n_occasion;               // Number of sampling occasions
  int<lower=0,upper=1> y[n_ind, n_occasion]; // Capture-history matrix
  }

transformed data {
  int<lower = 0, upper = 1> observed[n_ind];
  int<lower = 0, upper = 1> is_pop2[n_obs];
  
 for (i in 1:n_ind) {
    if (sum(y[i]) > 0) {
      observed[i] = 1;
    } else {
      observed[i] = 0;
    }
  }
  
  for (i in 1:n_obs) {
    is_pop2[i] = pop[i] - 1;
  }
}

parameters {
  real logit_in_pop;  // Inclusion probability
  real b;      // population-level effects on detection probability
  real Intercept;    // intercept for detection probability
  real logit_in_pop2;   // logit of being in pop. subset
}

transformed parameters {
  real <lower = 0, upper = 1> in_pop = inv_logit(logit_in_pop);
  real <lower = 0, upper = 1> in_pop2 = inv_logit(logit_in_pop2);
  
  vector[n_ind] lp_if_present;
  vector[n_ind] log_lik;

  for (i in 1:n_ind){
    if (observed[i]){
      lp_if_present[i] = bernoulli_lpmf(1 | in_pop) // z[i] ==1
       + bernoulli_lpmf(is_pop2 | in_pop2) // probability individual is in pop2
       + bernoulli_logit_lpmf(y[i]| Intercept + b * is_pop2[i]); // prob. detection given present 
       log_lik[i] = lp_if_present[i]; // Probability not in population
    } else {
      // augmented individuals, unknown popoulation
      lp_if_present[i] = bernoulli_lpmf(1 | in_pop)
        + log_mix(in_pop2, // P(in pop 2)
            bernoulli_logit_lpmf(y[i] | Intercept + b), // Pop 2
            bernoulli_logit_lpmf(y[i] | Intercept) // Pop 1
          );
      log_lik[i] = log_sum_exp(lp_if_present[i], bernoulli_lpmf(0 | in_pop));
      }
    }

}

model {
  // Priors 
  logit_in_pop ~ normal(0, 2.5);
  Intercept ~ normal(0, 2.5);
  b ~ normal(0, 1);
  logit_in_pop2 ~ normal(0, 2.5);
  // Likelihood
  target += sum(log_lik);
  
  }
  
  
  
generated quantities {
  int N;
  int N1;
  int N2;
  
  {
    vector[n_ind] lp_present; // [z = 1][y = 0 | z = 1] / [y = 0] on log-scale
    int z[n_ind];
    int z1[n_ind];
    int z2[n_ind];
    
    for (i in 1:n_ind) {
      if (observed[i]) {
        z[i] = 1;
        z1[i] = is_pop2[i] == 0;
        z2[i] = is_pop2[i];
      } else {
        lp_present[i] = lp_if_present[i] - log_lik[i];
        z[i] = bernoulli_rng(exp(lp_present[i]));
        z1[i] = z[i] * bernoulli_rng(1-in_pop2);
        z2[i] = z[i] * (1 - z1[i]);
      }
    }
    
    N = sum(z);
    N1 = sum(z1);
    N2 = sum(z2);
  }
  
}


The code I used to simulate the data and package things for Stan:

N_1 <- 500 # Pop 1 size
N_2 <- 700 # Pop 2 size
N <- N_1 + N_2 # Total pop
mean.p <- 0.67 # Detection Prob. 1
beta.p <- -1 # Logit difference for detection prob. 2
T <- 3 # Number of sampling events

# Create the mark-recpature dataset
data_mmulti <- tibble(id = c(rep(1:N_1, each = T), rep(1:N_2, each = T)),
                      pop = rep(c(rep(1, N_1), rep(2, N_2)), each = T),
                      samp = rep(1:T, N), 
                      mean.p = mean.p, 
                      beta.p = beta.p) %>%
  mutate(p = plogis(log(mean.p / (1-mean.p)) + beta.p*(pop-1)),
         pop = factor(pop)) %>%
  rowwise() %>%
  mutate(y = rbinom(1, 1, p)) %>%
  select(id, pop, samp, y) %>%
  pivot_wider(id_cols = c(id, pop), 
                            names_from = samp, 
                            names_prefix = 'samp', 
                            values_from = y)

# Just the detection histories
yfull <- data_mmulti %>%
  select(starts_with('samp'))
# Only individuals that were detected at least 1 time
ever.detected <-  yfull %>%
  apply(1, max)

n_obs <- sum(ever.detected) # individuals observed
yobs <- yfull[ever.detected ==1,] # Detection histories

# Population covariate for observed individuals
pop_obs <- data_mmulti %>%
  mutate(count = samp1 + samp2 + samp3) %>%
  filter(count != 0) %>%
  pull(pop) %>%
  as.numeric()

# Augment population by nz individuals
n_aug <- N_1*2 + N_2*2
yaug <- rbind(simplify2array(yobs), array(0, dim = c(n_aug, T)))

n_ind <- nrow(yaug) # Number of individuals after augmentation
n_occasion <- ncol(yaug) # Number of samples

# Package data for Stan
stan_data = list(y = yaug, n_ind = n_ind, n_aug = n_aug, 
                                n_obs=n_obs, n_occasion = n_occasion, pop = pop_obs)

# parameters to monitor
params <- c('Intercept', 'b', 'logit_in_pop', 'in_pop', 'logit_in_pop2', 'in_pop2', 'N', 'N1', 'N2')

# Initial values for parameters
inits <- function(){
  list(Intercept = rnorm(1, 0, 2.5), 
       b = rnorm(1, 0, 1),
       logit_in_pop = rnorm(1, 0, 2.5),
       logit_in_pop2 = rnorm(1, 0, 1))}

# HMC settings
ni <- 2000 # iterations
nt <- 1 # thinning
nb <- 1000 # warm-up
nc <- 4 # chains
# Run model
out_mmulti <- stan("Mmulti.stan",
               data = stan_data, init = inits, pars = params,
               chains = nc, iter = ni, warmup = nb, thin = nt,
               seed = 2,
               open_progress = FALSE)

So it’s not clear to me why you need to do this using log_mix over the subpopulations. Why not just treat the problem as 24 separate augmented capture history matrices?

That is, set it up something like this:

data {
  int<lower = 1> n_ind;           // Fixed size of augumented data across subpopulations 
  array[24] int<lower = 0> n_aug;    // Number of augmented individuals for each subpopulation
  array[24] int<lower = 0> n_obs;   // Number of observed individuals set for each subpopulation
//Note: you don't really need both of these since n_aug[i] + n_obs[i] = n_ind
  int<lower=0> n_occasion;               // Number of sampling occasions
  array[24] int<lower=0,upper=1> y[n_ind, n_occasion]; // Capture-history matrices
  }
parameters {
  array[24] real logit_in_pop;  // Inclusion probability
  array[24] real<lower = 0, upper = 1> p;
}

You could do partial pooling across the subpopulations on the detection probability side of things. But I wouldn’t do it on inclusion probability because that’s going to be set by the size of the largest number of individuals in any subpopulation + some extra for augmentation. It’s more convenient to write it this way so that every capture history matrix is the same size.

2 Likes

Hey @Dalton ,

Thanks for the exceptionally quick response and please excuse the delay in my reply - juggling a few things at the moment so didn’t have a chance to play with this alternative for several days.

I agree that avoiding log_mix is probably preferable. I probably went there first out of a lack of imagination more than anything else. However, sticking everything into an array doesn’t seem to be an off-the-shelf solution either. First, sticking a mutlidimensional integer (array) into another array doesn’t seem to be allowed (at least not with the version of Stan packaged with Rstan at present), so array[n_pop] int<lower = 0, upper =1> y[n_ind, n_occasion];
produces an error. I got something to run by moving all of the dimensions out front, array[n_pop, n_ind, n_occasion] int<lower = 0, upper = 1> y;, however, sampling is not recovering the simulated detection probability or population size. The probability of augmented individuals being in the population (in_pop) is always estimated to be very high (near the boundary at 1) and detection probability (p) is correspondingly very low. As you’d expect, population sizes are also on the upper boundary of n_ind. Increasing the number of individuals doesn’t seem to make a difference to any of this, so I suspect there’s something structurally wrong with my model specification.

I’ve played with this for a couple of days and all of my attempts lead to the same unhappy end. I’m open to the idea that some of this maybe a product of working with the old release of Stan that comes with Rstan, but I’m not sure enough that the solution lies in CmdStanR to invest in switching.

Open to any ideas about potential resolutions!

Re-posting my Stan code and reprex simulation below…


 data {
  int<lower = 1> n_pop;          // Number of populations
  int<lower = 1> n_ind;               // Size of augumented data set across all populations
  // array[n_pop] int<lower = 0> n_aug;         // Number of augmented individuals
  array[n_pop] int<lower = 0> n_obs;             // Size of observed data set for each population
  int<lower=0> n_occasion;               // Number of sampling occasions
  array[n_pop, n_ind, n_occasion] int<lower=0,upper=1> y; // Capture-history matrix
}

transformed data {
  array[n_pop, n_ind] int<lower = 0, upper = 1> observed;
  array[n_pop] int<lower = 0> n_aug; // Number of augmented individuals in each population
  
  for (i in 1:n_pop){
    n_aug[i] = n_ind - n_obs[i];
  }
  
  for (j in 1:n_pop){
    for (i in 1:n_ind) {
      if (sum(y[j,i,]) > 0) {
        observed[j, i] = 1;
      } else {
        observed[j, i] = 0;
      }
    }
  }

}

parameters {
  array[n_pop] real logit_in_pop;  // Inclusion probability
  array[n_pop] real logit_p;      // detection probability
  
}

transformed parameters {
  array[n_pop] real <lower = 0, upper = 1> in_pop; 
  array[n_pop] real <lower = 0, upper = 1> p; 
  in_pop = inv_logit(logit_in_pop);
  p = inv_logit(logit_p);

}

model {
  // Priors 
  logit_in_pop ~ normal(0, 2.5);
  logit_p ~ normal(0, 2.5);
 
  // Likelihood
  for (j in 1:n_pop){
    for (i in 1:n_ind){
      if (observed[j,i]){
      // z[i] == 1
        target += bernoulli_logit_lpmf(1 | logit_in_pop[j]) + 
        binomial_logit_lpmf(observed[j,i] | n_occasion, logit_p[j]);
      } else { // s[i] == 0
        target += log_sum_exp( bernoulli_logit_lpmf(1 | logit_in_pop[j])  + // z[i] == 1
                              binomial_logit_lpmf(0 | n_occasion, logit_p[j]),
                             bernoulli_logit_lpmf(0 | logit_in_pop[j])); // z[i] == 0
      }
    }  
  }  
}

generated quantities {
  // prob present given never detected
  array[n_pop] real in_pop_nd;
  array[n_pop] int<lower=1, upper=n_ind> N;
  
  for (i in 1:n_pop){
    in_pop_nd[i] = (in_pop[i] * (1 - p[i])^n_occasion) / (in_pop[i] * (1 - p[i])^n_occasion + (1 - in_pop[i]));
    N[i] = n_obs[i] + binomial_rng(n_ind-n_obs[i], in_pop_nd[i]);
  }
  
}

# Data simulation function
# pop = number of populations
# N = number in pop
# p = detection prob
# n_occ = number of samples
data_m0_multi.fn <- function(pop = 1, N = 100, p = 0.5, n_occ = 3){
  yfull <- yobs <- array(NA, dim = c(max(N), n_occ, pop))
  for (i in 1:pop){
    for (j in 1:n_occ){
      yfull[1:N[i],j,i] <- rbinom(n = N[i], size = 1, prob = p[i])
    }
  }
  ever.detected <- apply(yfull, c(1,3), max)
  C <- apply(ever.detected, 2, sum, na.rm = T)
  for (i in 1:pop){
    yobs[1:C[i], , i] <- yfull[ever.detected[,i]==1 & !is.na(ever.detected[,i]), , i]
    cat(C[i], "out of", N[i], "animals present were detected.\n")
  }
  
  return(list(pop = pop, N = N, p = p, C = C, n_occ = n_occ, yfull = yfull, yobs = yobs))
}

data_m0  <- data_m0_multi.fn(pop = 4, 
                             N = round(rnorm(4, 200, 40)), 
                             p = runif(4, 0.1, 0.7), 
                             n_occ = 3)


# Total size of population after augmentation
n_tot <- 750

yaug <- array(0, dim = c(n_tot, data_m0$n_occ, data_m0$pop))

# Add detection histories for observed individuals
for (i in 1:data_m0$pop){
  yaug[1:data_m0$C[i], , i] = data_m0$yobs[1:data_m0$C[i], , i]
}

# Change order of array indices to correspond to Stan model
yaug = aperm(yaug, c(3,1,2))

# Package model data
stan_data = list(n_pop = data_m0$pop, n_ind = n_tot, 
                 n_obs = data_m0$C, n_occasion = data_m0$n_occ, 
                 y = yaug)
  
# parameters to monitor
params <- c('logit_p', 'logit_in_pop', 'p', 'in_pop', 'N')

inits <- function(){
  list(logit_p = rnorm(4, 0, 2.5), logit_in_pop = rnorm(4, 0, 2.5))}

# HMC settings
ni <- 2000 # iterations
nt <- 1 # thinning
nb <- 1000 # warm-up
nc <- 4 # chains

# Run model
out <- stan("Mmulti_v2.stan",
            data = stan_data, init = inits, pars = params,
            chains = nc, iter = ni, warmup = nb, thin = nt,
            seed = 2,
            open_progress = FALSE)

I just took a quick look at your code - but haven’t had a chance to run it yet. I don’t think your issue is rstan versus cmdstanr. (FWIW, I think the switch to cmdstanr is well worth the time investment.) And you were absolutely correct that I incorrectly defined the array for the capture history matrix.

Anyway, I think your problem is this bit here from the likelihood (line 55 by my reckoning):

binomial_logit_lpmf(observed[j,i] | n_occasion, logit_p[j]);

observed[j,i] is always going to be 1 (because of the if statement just above). Based on Hiroki Ito’s translations that you linked to, I think you want that part of the likelihood to be something like:

binomial_logit_lpmf(sum(y[j,i]) | n_occasion, logit_p[j]);

I’ll test it out on my end tomorrow and see if that behaves as expected. But I’m pretty sure that’s your issue now.

1 Like

Yep, you spotted it correctly, @Dalton. Silly mistake on my part…

Hiroki Ito’s translations implement a few different strategies in the transformed data block to summarize capture histories across the different flavors of basic capture-recapture models. I suppose I just stacked the bits and pieces together incorrectly. Glad to have that sorted and be able to continue developing this model to suit my purposes.

Also, I tend to agree with you about cmdstanr and making that move is on my to-do list. I had some install issues that seem to have been resolved with the latest cmdstanr release, so should be able to take the plunge soon.

Thanks again!

1 Like