Blog posts on ecological modeling in Stan

mhollanders · February 4, 2025, 3:22am

Hi all,

Allow me to shameless plug two of my blog posts on the website of my consultancy Quantecol on ecological modeling in Stan. The two relevant posts are:

Marginalising discrete variables in ecological models, where I cover fitting occupancy and N-mixture models and recovering the posterior distributions of the discrete parameters in the generated quantities block using multiple parameterisations and model types.
A unified approach to ecological modeling in Stan, where I show how the forward algorithm is a scaleable approach for fitting various ecological models with increasing complexity.

I hope these posts will be useful to some, and by all means let me know if I made any mistakes.

Thanks,

Matt

Bob_Carpenter · February 4, 2025, 4:15pm

Thanks for posting. I took a look through the first one, and overall, I’d suggest adding some more text to connect the dots on the models for readers. I imagine even readers who use these models are not going to be well acquainted with why they are set up the way they are.

I’d suggest reading and citing this paper from @monnahc—it’s aimed at ecologists and it has some really nice evaluations.

Cole C. Monnahan, James T. Thorson and Trevor A. Branch. Faster estimation of Bayesian models in ecology using Hamiltonian Monte Carlo. Methods in Ecology and Evolution.

I also wrote the occupancy case study for Stan, which is ancient and may not even run any more: Multiple Species-Site Occupancy Model

Rather than mixing the frequentist likelihood notation \mathcal{L} and conditioning |, I would suggest just following Gelman et al. in Bayesian Data Analysis and writing densities as p(y \mid \psi, p).

I’m using mostly flat or weakly informative priors.

People new to Bayes fret endlessly about priors. I’d stick to weakly informative versions and explain that they 're both better for computation and for statistics. Gelman has a great paper on this, where he analyzes priors on variance: https://sites.stat.columbia.edu/gelman/research/published/taumain.pdf

Before asking people to install and run a package like pacman, I’d suggest explaining what it is. The fewer dependencies there are in a case study, the longer lived it will tend to be.

Rather than this:

  vector[I] log_psi = log(psi), log1m_psi = log1m(psi);
  for (i in 1:I) {
    if (Q[i]) {
      target += log_psi[i] + bernoulli_lpmf(y[i, 1:J[i]] | p[i, 1:J[i]]);
    } else {
      target += log_sum_exp(log1m_psi[i], 
                            log_psi[i] + bernoulli_lpmf(y[i, 1:J[i]] | p[i, 1:J[i]]));
    }
  }

you can write this:

for (n in 1:I) {
  if (Q[n]) {
    int len = J[n];
    int ys = y[1:len];
    int ps = p[1:len];
    target += log_mix(psi, bernoulli_lpmf(ys, ps), 
                           bernoulli_lpmf(ys, ps));
}

To make this even more efficient, you could sort out the indexes for which Q[n] != 0 in the transformed data block and cut out the conditional. It’s right on the edge of being worthwhile, but I’d be tempted to write this function:

array[] int get_vals(array[] int y, array[] int lengths, int n) {
  int len = lengths[n];
  return y[n, 1:len];
}

and similarly for real values, then replace all of the indexing in the code with calls to get_vals.

It might be clearer to write beta_lupdf(psi_a | 1, 1) as uniform(psi_a | 0, 1) if your audience isn’t familiar with beta distributions. This is implicitly a standard logistic prior on the log odds, and you can frame it that way, too, especially if you want to do hierarchical modeling, which is much easier on the unconstrained scale.

Not your fault—it’s R, or more specifically the tidyverse, but I really dislike the term tibble and that it wastes a line reminding you every output.

Usually one would use loo to compare models, though estimating log predictive density also works. Are the draws really iid though? I think they’re all different sizes.

I don’t see the connection between the multinomial and the occupancy model. The occupancy model was a mixture and I don’t see that in the multinomial model. If this is still something like a mixture, shouldn’t this be regular addition rather than log-sum-exp?

lp[1] = log_sum_exp(log1m(psi_a), lp[1]);

I think it might make more sense to talk about mixtures first, since you use one in the first example.

log_mix(lambda, lps) can be used for multivariate mixtures where lambda is a simplex and lps is a vector (or maybe array) of log densities.

I like the illustration of setting K.

This is doing a lot of extra work. The way to make Stan code more efficient is to remove redundant calculations. For instance,

vector[I] psi = rep_vector(psi_a, I);
  vector[I] lambda = rep_vector(lambda_a, I);
  matrix[I, J_max] p = rep_matrix(p_a, I, J_max);
  
  // log-transform occupancy probabilities
  vector[I] log_psi = log(psi), log1m_psi = log1m(psi);

can be

vector[I] log_psi = rep_vector(log(psi), I);
vector[I] log1m_psi = rep_vector(log1m(psi), I);

Even better, avoid putting these into matrix form and rely on broadcasting.

I would start the last section mentioning that it’s going to take the form of a survival model, with right censoring.

I found the \min notation in the definition of y confusing and I don’t know what pmin does in R. I think what you’re saying is that z_{m, n} \sim \text{exponential}(...) and y_{m, n} = \min(z_{m, n}, \Delta_{m, n}). I didn’t understand the reasoning behind setting \Delta = 1. Nor am I clear on why there is even a minimum value here. It looks like the units of y are number of detections but \Delta is time. I could use some help with the units. In Stan, the \min operation is going to introduce a discontinuity in the model, which can be difficult for optimization and HMC.

mhollanders · February 4, 2025, 10:40pm

Hey Bob,

Thanks for the thorough critique! I think I’ll update the blog with your suggestions when I get a chance. I didn’t realise I could be using log_mix() here. Just out of curiosity, do you actually need the conditional here?

for (n in 1:I) {
  if (Q[n]) {
    int len = J[n];
    int ys = y[1:len];
    int ps = p[1:len];
    target += log_mix(psi, bernoulli_lpmf(ys, ps), 
                           bernoulli_lpmf(ys, ps));
  }
}

I tried doing just

for (n in 1:I) {
  int len = J[n];
  int ys = y[1:len];
  int ps = p[1:len];
  target += log_mix(psi, bernoulli_lpmf(ys, ps), 
                         bernoulli_lpmf(ys, ps));
}

and it yielded the same result. But if it’s not necessary, would it still be not more efficient to avoid log_mix for the sites that had any detections, because there’s nothing to marginalise (we know the site was detected)?

I don’t see the connection between the multinomial and the occupancy model. The occupancy model was a mixture and I don’t see that in the multinomial model. If this is still something like a mixture, shouldn’t this be regular addition rather than log-sum-exp?

The multinomial model with fixed occupancy and detection probabilities exploits the fact that there’s only finite outcomes (only k \in 0:J number of detections across J surveys are possible. Therefore you just compute the cell probabilities of these detections and use a multinomial. Let me know if I got that wrong. And the below line just incorporates the probability of not being occupied for lp[1] which is the cell probability with 0 detections, so the only one that needs to account for a site not being occupied:

lp[1] = log_sum_exp(log1m(psi_a), lp[1]);

This is doing a lot of extra work. The way to make Stan code more efficient is to remove redundant calculations.

I mention in the blog that even though this is an intercepts-only model, I’m just setting up the vector of occupancy probs and matrix of detection probs if people want to use this code for their own models where they presumably have covariates on occupancy and detection. In the blog’s case, just doing rep_vector(log(psi), I) is better but I’m not making the assumption that all sites have the same psi and p.

It looks like the units of y are number of detections but \Delta is time. I could use some help with the units.

In the time-to-event model, y_{ij} is the waiting time to first detection, whereas in the zero-inflated N-mixture model, y_{ij} is the number of individuals observed during a survey. Therefore I didn’t think I could use loo to compare these models.

EDIT: updated syntax highlights.

Thanks again!

Richard_Erickson · February 5, 2025, 4:58pm

Thanks for sharing your work here. I have made some earlier posts with similar work, such as
Occupancy modeling in Stan.

I will have to dig more. We seem to be using slightly different methods, which is cool.

Bob_Carpenter · February 10, 2025, 11:59pm

The way this is written, you’re mixing two identical components, and the syntax is wrong. That is,

target += log_mix(psi, bernoulli_lpmf(ys, ps), 
                           bernoulli_lpmf(ys, ps));

is equivalent to

target += bernoulli_lpmf(ys | ps);

I’m not sure why you’re doing a log-sum-exp here. What are lp[1] and log1m(psi_a)?

As long as the observations are conditionally iid, then you can use loo. If they form a time series, then you need a specialized form.

mhollanders · February 11, 2025, 9:38pm

Hey Bob,

With J surveys and shared occupancy and detection probabilities, there are J + 1 possible data outcomes, as we can make 0, 1, \dots, J detections. For every outcome j (the number of surveys where the species was detected), we need to compute the log probability of a site being occupied (log_psi is the log detection probability) associated with that detection history:

vector[J + 1] lp;
for (1:(J + 1)) {
  lp[j] = log_psi + binomial_lpdf(j - 1 | J, p);
}

Only when j = 0 (the species was never detected) do we need to account for the fact that the site was maybe not occupied. Therefore, we update lp[1] to the mixture of the occupied and not occupied components (log1m_psi is the log of the probability of not being occupied):

lp[1] = log_sum_exp(log1m_psi, lp[1]);

As long as the observations are conditionally iid, then you can use loo . If they form a time series, then you need a specialized form.

I believe the individual log-likelihoods from this multinomial model is the exact same as the site-level occupancy model, so I was just using loo_compare to show the models were the same.

Bob_Carpenter · February 11, 2025, 9:45pm

I didn’t quite follow. lp[1] is a number, not a mixture of components.

As long as the result you want is log( (1 - psi) * p[1] ), or with the original values, log(exp(log1m_psi) + exp(lp[1]))

Topic		Replies	Views
Marginalizing occupancy models for Stan implementation Publicity ecology	8	1480	November 23, 2022
A guide to Bayesian Model Checking for ecologists Publicity ecology	6	1633	July 20, 2018
Occupancy modeling in Stan Publicity ecology	7	2293	November 3, 2021
Marginalization for non-experts Publicity ecology , marginal-likelihood	0	782	April 29, 2020
Workshop on Bayesian Data Analysis at the Ecology Across Border meeting Publicity ecology	10	1675	December 21, 2017

Blog posts on ecological modeling in Stan

Related topics