Marginalizing occupancy models for Stan implementation

Hi all–my first post here. I’ve written up a document for non-experts like myself to better understand how marginalization works in the context of the site-occupancy model for biological species surveys. Ecologists are more familiar with a parameterization that explicitly models a latent binary occupancy state; this state needs to be marginalized out in order to fit the model in Stan.

https://jsocolar.github.io/marginalizedOccupancy/

The Stan Case Studies include a Stan translation of a more complicated occupancy model. This case study relies on a lack of visit-specific covariates to do the marginalization, and I think many researchers in my community would find it difficult to make the jump from the case study to other classes of occupancy model.

Feel free to contact me with corrections or comments. Likewise, if there is anywhere I should post this to make it more accessible to potential Stan users, I’m happy to do so.

Best,
Jacob Socolar

8 Likes

As an ecologist who has mostly avoided Stan because of a lack of time to learn to marginalize and efficient Stan scripting. Thanks. One suggestion would be to report effective samples per minute, including for the Stan model. This will be even more impressive than just the general speedup. It will also be neat to use the new ESS calculations and for the 95% credible intervals of the slowest converging parameter. This is likely where marginalization and especially Stan will shine.

1 Like

I also find the latent binary parameter formulation natural statistically. We’re not trying to hide that or get rid of it. We just want to compute it robustly and efficiently.

The only robust way to compute posteriors that are at all complicated is to marginalize out any latent binary parameters and then do inference with them in expectation. I tried to explain how to do this and why it’s so much more efficient for a very simple model in the very first example in the latent discrete parameters chapter of the user’s guide. That’s true even if you’re working in BUGS or JAGS—they’re terrible at sampling discrete parameters because nobody knows how to sample discrete parameters effectively (you can tell by looking at poor mixing in simulated data; the problem is lack of gradient information to guide sampling). There’s a neat paper explaining this that also considers the marginalized JAGS/BUGS models:

Luckily, before MCMC, ecologists used EM to fit maximum marginal likelihood estimates. EM requires exactly the same marginalizations of discrete parameters. So all you need to do is go back to the 1980s literature and you can find all the marginalizations. Alas, this doesn’t make them any easier to understand.

If you want to see this in practice, Hiroki Îto has translated the examples from the Kery and Schaub Bayesian Population Analysis book.

4 Likes

Hi Jacob,
Thank you very much for your post!. I am new to Bayesian modelling and reading your tutorial was very helpful after several hours getting errors. Now I can run occupancy models on simulated data, but I haven’t been able to find a way to fit these models in Stan using real data given that these includes NAs in some cells of the detection matrix. Any advice on how to deal with this issue?

Thanks,

Sebastian

1 Like

Welcome @Sebastian_Botero! Here’s one solution (include a binary indicator for whether there is missing data for a sampling occasion): Ragged data and mixture (occupancy) model

You can additionally provide a fill value for your observation matrix y like -1, to make sure you get an error if you accidentally pass it to a bernoulli or binomial pmf.

2 Likes

Welcome indeed @Sebastian_Botero! I pack my data so that the detection/non-detection matrix has one row per site (or per species-site in a multispecies model), and one column for each visit up to the maximum number of visits.

Note that if detection probabilities are constant within rows, you don’t need to worry about the ragged data structure at all. Instead, you can simply pass a two column matrix with the number of detections and the number of visits, and use binomial sampling. Exploit this trick whenever possible; it is dramatically faster (both to write and to fit)!

If you do have detection probabilities that vary within rows, then @mbjoseph’s solution in the thread linked above will work well, and padding out the array with negative numbers is a great way to prevent errors. If efficiency is a big concern, you can also pack the matrix so that all NAs (coded as -1s or similar) are trailing within their rows*, and then pass as data an integer array V giving the number of visits to expect in each row (making sure to pack the visit-specific covariates in a corresponding manner, again with dummy values , e.g. -9999, to pad out the array). Then you’ll loop over the rows i and the visits j, and the only modification to how I imagine your existing code looks is that the loop over visits needs to range over for (j in 1:V[i]) rather than for (j in 1:max(V)).

*by “all NAs trailing in their rows” I mean that if you visited a given site during sampling rounds 1, 3, and 4, with no visit in round 2, you can redefine visit 3 to be visit 2, and visit 4 to be visit 3, so that you have observed data in columns 1, 2, and 3 and an NA (coded as -1 or similar) in column 4. If you have 2 NAs, they should go in columns 3 & 4, etc.

1 Like

Thank you very much @mbjoseph and @jsocolar for your prompt answer!. I do have different detection probabilities for each sampling occasion, but with the way I have my model marginalized, I don’t loop over j as they are vectorized. I’m not quiet sure where (and if possible at all) to place the conditional on sample. I would really appreciate your advice on how to include the conditional for each sample, or if you could direct me towards a different way to structure the model. Again thank you!

Here is my code:

data {
  int<lower=0> n_unit; //number of sites
  int<lower=0> n_rep; //number of sampling events
  int<lower=0, upper=1> det_data[n_unit, n_rep]; //detection history
  real<lower=0, upper=1> Q[n_unit]; //at least one detection
  real ndviSD[n_unit]; //occ covariate 2
  real RH[n_unit, n_rep]; //detection covariate
  int sampled[n_unit, n_rep] // indicator of whether a site was sampled
}
parameters {
  real a0; 
  real a1; 
  real b0;
  real b1;
 
}
transformed parameters{
  real psi[n_unit];
  real theta[n_unit, n_rep];
  for(i in 1:n_unit){
    psi[i] = inv_logit(a0 + a1 * ndviSD[i] );
    for(j in 1:n_rep){
        theta[i,j] = inv_logit(b0 + b1 * RH[i,j]);
    }
  }
}
model {
  for(i in 1:n_unit){
    if(Q[i] == 1) {
      target += log(psi[i]);
      target += bernoulli_lpmf(det_data[i] | theta[i]); //vectorized over repeat sampling events j
    }
    if(Q[i] == 0) {
      target += log_sum_exp(log(psi[i]) + log1m(theta[i,1]) + log1m(theta[i,2]) + log1m(theta[i,3]), 
                    log1m(psi[i]));
    }
  }
}

EDIT: @maxbiostat edited this post for syntax highlighting.

1 Like

Where you have target += bernoulli_lpmf(det_data[i] | theta[i]);, you need to pull out the indices of det_data[i] and theta[i] that correspond to visits that actually occurred. If the arrays are packed so that all NAs are trailing within their rows, this could be det_data[i, 1:V[i]] where V is an integer array of the number of visits, passed as data.

Got it. Thank you very much!