Multi-species occupancy - no detection history

Hello everyone,

I’m a Stan novice, and trying to fit my first model. I was hoping folk here could give me:
(a) some reassurance that the model is doing what I think it is, and
(b) pointers for improving model efficiency.

Background
I have occupancy data for 324 plant species at 128 sites, but only as presence/absence, without a detection history. The sites are small and the surveys were exhaustive, so the assumption that detection = 1 is hopefully fairly reasonable.
Ultimately, I’m looking to determine the influence of site context on species occupancy.

Model (see code below)
I want to model plant species’ occupancy as influenced by 13 potential predictors from various site and landscape factors (I’m hoping to drop some of these during model refinement).
I anticipate that different species will respond to different predictors, so I’m seeking species-specific intercept and slope terms.
I was planning to have some interactions as well, but was hoping to get the model working with just additive terms initially.

Questions

  • Does this model do what I want it to? I extracted species-specific intercept and slope estimates from preliminary runs, although most have very wide CIs spanning 0. I have a bit of experience fitting MSOMs in JAGS with species’ detection histories, with the data in 3D arrays. I assumed that in this case I can simplify the data to a single long vector because I only have a single presence/absence record for each species. However, will the model still recognise that multiple species records are from the same site? Do I need to include some sort of site identifier in the model? Would this help to narrow the CIs? (I did try this with a 2D site x species matrix, but was struggling to get the code to work, and I’m not sure the output would have been any different from the version below anyway.)

  • Can I make this model structure more efficient? A preliminary run with 1k iterations worked (run time ~4 hours), but had poor Rhat and ESS values. I increased to 10k iterations, but it was slow (4 chains = 50, 55, 74 and 107 hours respectively). And I think RStudio has been caught in some sort of fatal memory issue(?) The chains finished a week ago, but RStudio is still working on something. I’ve read that vectorising the code can improve efficiency, but when I’ve tried this, I get an “expression ill formed” error, and I’m not certain why? Is it possible to vectorise this structure, whilst retaining species-specific terms?

Any suggestions and advice would be very welcome. Many thanks.

Code

// Model
data {
int N;  // Number of rows in data
int N_sp; // Number of species (324)
int sp [N]; // Vector length N of species ID codes
int<lower=0, upper=1> Occ [N]; // Vector length N of 0/1 occupancy

// Predictors, each a vector of length N
vector [N] p1;
vector [N] p2;
vector [N] p3;
vector [N] p4;
vector [N] p5;
vector [N] p6;
vector [N] p7;
vector [N] p8;
vector [N] p9;
vector [N] p10;
vector [N] p11;
vector [N] p12;
vector [N] p13;

}

parameters {
// One intercept for each species
real intercept[N_sp];
// Slope terms for each species
real slope_1[N_sp];
real slope_2[N_sp];
real slope_3[N_sp];
real slope_4[N_sp];
real slope_5[N_sp];
real slope_6[N_sp];
real slope_7[N_sp];
real slope_8[N_sp];
real slope_9[N_sp];
real slope_10[N_sp];
real slope_11[N_sp];
real slope_12[N_sp];
real slope_13[N_sp];
// Mean of intercepts
real mean_intercept;
// Mean of slopes
real mean_slope;
// Scale of intercept and slope terms (common to all species)
real<lower=0> mean_scale;
}

model {
// Likelihood, with species-specific intercept and slope
for (i in 1:N) {
Occ[i] ~ bernoulli_logit(intercept[sp[i]] + 
                          slope_1[sp[i]] * p1[i] + 
                          slope_2[sp[i]] * p2[i] + 
                          slope_3[sp[i]] * p3[i] + 
                          slope_4[sp[i]] * p4[i] +  
                          slope_5[sp[i]] * p5[i] + 
                          slope_6[sp[i]] * p6[i] + 
                          slope_7[sp[i]] * p7[i] + 
                          slope_8[sp[i]] * p8[i] + 
                          slope_9[sp[i]] * p9[i] + 
                          slope_10[sp[i]] * p10[i] + 
                          slope_11[sp[i]] * p11[i] + 
                          slope_12[sp[i]] * p12[i] + 
                          slope_13[sp[i]] * p13[i]); 
}


// Priors
// Species-specific intercept and slope priors, with common 
// location and scale. 
for (j in 1:N_sp) {
  intercept[j] ~ cauchy(mean_intercept, mean_scale);
  slope_1[j] ~ cauchy(mean_slope, mean_scale);
  slope_2[j] ~ cauchy(mean_slope, mean_scale);
  slope_3[j] ~ cauchy(mean_slope, mean_scale);
  slope_4[j] ~ cauchy(mean_slope, mean_scale);
  slope_5[j] ~ cauchy(mean_slope, mean_scale);
  slope_6[j] ~ cauchy(mean_slope, mean_scale);
  slope_7[j] ~ cauchy(mean_slope, mean_scale);
  slope_8[j] ~ cauchy(mean_slope, mean_scale);
  slope_9[j] ~ cauchy(mean_slope, mean_scale);
  slope_10[j] ~ cauchy(mean_slope, mean_scale);
  slope_11[j] ~ cauchy(mean_slope, mean_scale);
  slope_12[j] ~ cauchy(mean_slope, mean_scale);
  slope_13[j] ~ cauchy(mean_slope, mean_scale);
}

// Priors for common location and scale. 
// Using Cauchy(0, 2.5) distribution
// following Bob Carpenter's MSOM Case Study.
// https://mc-stan.org/users/documentation/case-studies/dorazio-royle-occupancy.html
mean_intercept ~ cauchy(0, 2.5);
mean_slope ~ cauchy(0, 2.5);
mean_scale ~ cauchy(0, 2.5); 

}

There are a few places here to improve efficiency as well as reduce divergences:

  • Work with matrices rather than multiple real[N] arrays for the predictors and slopes
  • Hierarchical models like this (generally) benefit from a non-centered parameterisation, which you could implement very simply by changing the cauchy priors on the intercept and p parameters to normal and using the offset and multiplier syntax in the declaration
  • Can compute the likelihood in a vectorised way using columns_dot_product

Putting this all together, your model could look like

data {
  int N;  // Number of rows in data
  int N_sp; // Number of species (324)
  int sp [N]; // Vector length N of species ID codes
  int<lower=0, upper=1> Occ [N]; // Vector length N of 0/1 occupancy

  // Predictors, each a vector of length N
  vector [N] p1;
  vector [N] p2;
  vector [N] p3;
  vector [N] p4;
  vector [N] p5;
  vector [N] p6;
  vector [N] p7;
  vector [N] p8;
  vector [N] p9;
  vector [N] p10;
  vector [N] p11;
  vector [N] p12;
  vector [N] p13;
}

transformed data {
  // Pack the input vectors into a single matrix with
  // a column for each individual
  //  - more efficient to extract columns than rows
  matrix[13,N] x = [p1', p2', p3', p4', p5',
                     p6', p7', p8', p9', p10',
                     p11', p12', p13'];
}

parameters {
  // Mean of intercepts
  real mean_intercept;
  // Mean of slopes
  real mean_slope;
  // Scale of intercept and slope terms (common to all species)
  real<lower=0> mean_scale;

  // One intercept for each species
  // Use non-centered parameterisation for intercept parameter
  row_vector<offset=mean_intercept, multiplier=mean_scale>[N_sp] intercept;

  // Estimate species coefficients as a single matrix also
  // with non-centered parameterisation
  matrix<offset=mean_slope, multiplier=mean_scale>[13, N_sp] beta;
}

transformed parameters {
  // Two components of the columns_dot_product call to understand:
  //  - beta[:,sp] - This creates an 13xN matrix by extracting the column
  //                   for each observation as specified by the sp array
  //  - columns_dot_product - By computing the dot_product of the respective
  //                   columns of x and beta[:,sp] this is computing the
  //                   multiplication of that individual's covariate values
  //                   with their species-specific slopes, and then aggregating
  row_vector[N] theta = intercept[sp] + columns_dot_product(x, beta[:,sp]);
}

model {
  // Priors for common location and scale. 
  // Using Cauchy(0, 2.5) distribution
  // following Bob Carpenter's MSOM Case Study.
  // https://mc-stan.org/users/documentation/case-studies/dorazio-royle-occupancy.html
  mean_intercept ~ cauchy(0, 2.5);
  mean_slope ~ cauchy(0, 2.5);
  mean_scale ~ cauchy(0, 2.5); 

  intercept ~ normal(mean_intercept, mean_scale);
  // Need to use `to_vector()` on `beta` as the normal
  // likelihood doesn't have a signature for matrices
  to_vector(beta) ~ normal(mean_slope, mean_scale);

  // Likelihood, with species-specific intercept and slope
  Occ ~ bernoulli_logit(theta);
}

3 Likes

@jsocolar

@TBL welcome!

Given that this occupancy model assumes the probability of detection is 1, you could treat this as a logistic regression problem. The easiest way to implement logistic regression in Stan may be via brms. Using brms would give you more flexibility to experiment with different model forms, and provide some optimizations to the underlying Stan program for free.

4 Likes

Even if this were an occupancy model with multi-visit detection histories, unless you add some additional fancy structure the model does not pay attention to whether or not histories involving different species are co-localized at the same site. There are classes of model that do account for this (see So Many Variables: Joint Modeling in Community Ecology - ScienceDirect), but they are pretty challenging to grapple with. Such models can be paired with either an occupancy-model likelihood (involving a detection history) or a logistic regression likelihood (as in the linked paper). These implementations have been wrapped into the spOccupancy package (Single-Species, Multi-Species, and Integrated Spatial Occupancy Models • spOccupancy).

AFAIK I don’t think the authors of that package are active on this forum; @prototaxites might be our best resident expert with these models.

Edit: to be clear, I don’t actually recommend that you pursue these joint models unless you have a specific reason to believe that they are exactly what you need.

2 Likes

If you have multiple records of a species at a site across multiple rows, these ‘duplicates’ will provide additional information towards the parameter estimates for each environmental predictor, regardless of whether site identity is included in the model or not.

Regarding site identity, I think the most important consideration (before even considering fancy latent variable structures over say, a simple species by site intercept) is the task you’re planning to put the model to. If you’re only interested in the sites that you have already have data for, then including additional parameters that can capture site-specific variation could potentially improve the precision of your model. However, the trade-off is generalisability - it’s harder to extend the predictions beyond the dataset as it requires marginalising out the site-specific variables, which is not necessarily easy, especially for non-Gaussian likelihoods. Thus, if your aim is to predict occupancy across, say, a landscape, you’re probably better off without site-specific terms, but the additional records will still inform your model.

2 Likes

Thanks to everyone for their ideas and suggestions, it’s really appreciated.

@andrjohns this code is great, thank you. I wanted to test it and make sure I could recover the species-specific estimates for the separate beta parameters before I replied. Model run time with 1k iterations is down from 4 hours to 10 minutes, and with no divergence issues etc.

I hope you don’t mind a few follow-up questions? The first is directly about your code, the other two are more general.

  1. How can I incorporate interaction terms in the matrix? By directly multiplying the predictors? e.g., if I wanted an interaction between p1 and p2;
transformed data {
  matrix[14,N] x = [p1', p2', p3', p4', p5',
                     p6', p7', p8', p9', p10',
                     p11', p12', p13', p1' * p2'];
}

Or could I now drop the initial p1' and p2', akin to interactions in lme4 syntax?

  1. I didn’t know about non-centred parameterisation / offsets/ multipliers and hierarchical models, can you recommend any good sources for reading about this? I’ve read the section in the Stan User Guide, but I’d like to understand a bit more.

  2. What is a reasonable number of iterations? I’ve read that HMC’s efficiency allows lower numbers than JAGS etc., but what’s considered a reasonable amount? Will anyone (reviewers!) query using Stan’s 2k default? Do I need to go higher if the model is already performing well?


@mbjoseph Thanks for the brms tip. This model is part of a larger project that will involve other MSOMs with detection histories that I planned to implement in Stan, so it seems neater to stick with a single approach. I was starting with this model as I thought it would be an easier way to get into using Stan…


@jsocolar and @prototaxites for my circumstances it sounds like I’m better sticking with the current structure and avoiding site-specific terms. Thanks for your insights, I’m still getting my head round these models.

2 Likes

Great to hear that it’s improved so much!

How can I incorporate interaction terms in the matrix?

You would need to create a new column which was the multiplication of the two predictors. The easiest approach here is to use the model.matrix function in R to create this for you, using standard R formula syntax.

As a toy example, say you had a model with one outcome and two predictors, as well as the interaction between them:

y ~ x1 * x2

You can create the matrix of predictors directly using:

# Generate some fake data
input_data <- data.frame(
  y = rnorm(50),
  x1 = rnorm(50),
  x2 = rnorm(50)
)

# Request no intercept column, since it's modelled separately
p <- model.matrix(y ~ 0 + x1 * x2, data = input_data)

> head(p)
           x1          x2        x1:x2
1 -0.08761458 -0.29990177  0.026275767
2 -0.34420175 -0.37392924  0.128707100
3  0.89466650  1.45600468  1.302638606
4 -0.11134544  0.04162515 -0.004634771
5 -1.49181654 -0.28055218  0.418532379
6  0.43617859 -1.17201359 -0.511207230

To pass this to Stan, you first transpose the result (because we want a column for each person), and replace the vector[N] p* inputs with the single matrix:

stan_input <- t(p)

stan_data <- list(
  N = ncol(stan_input),
  J = nrow(stan_input),
  x = stan_input
)

Where the data section of the Stan model would include:

data {
  int N;
  int J;
  matrix[J, N] x;
  ...
}

I didn’t know about non-centred parameterisation / offsets/ multipliers and hierarchical models, can you recommend any good sources for reading about this? I’ve read the section in the Stan User Guide, but I’d like to understand a bit more.

I’m not too familiar with other literature introducing the concepts, so I don’t have any good recommendations here sorry.

What is a reasonable number of iterations? I’ve read that HMC’s efficiency allows lower numbers than JAGS etc., but what’s considered a reasonable amount? Will anyone (reviewers!) query using Stan’s 2k default? Do I need to go higher if the model is already performing well?

In most cases 2000 iterations will be sufficient, as long as the sampling diagnostics are satisfied. You want enough iterations to have sufficiently explored the posterior, as well as stable sampling diagnostics. The R-hat statistic across multiple chains gives you an indication of whether the posterior has been explored (but only an indication). This is supplemented by the effective sample size statistics, which (to put it very roughly) gives you an estimate of the number of iterations that provided ‘new’ information about the posterior. So it’s more about these diagnostics rather than a rule-of-thumb number.

Depending on the complexity/runtime of the model, it can also be a good idea to try running a model with an increased number of iterations (e.g., double) to see if the estimates or diagnostics change due to a previously-unexplored area of the posterior

1 Like

Thanks so much for your help, @andrjohns !

Chiming in to offer another reference on the non-centered parameterisation. The case study by @betanalpha is a pretty good starting point for why/how the non-centered parameterisation works in the context of HMC. Keep in mind that the case study pre-dates the multiplier and offset syntax employed by @andrjohns code above, but the idea is pretty much the same.

I’m sure there are many (many) other references scattered about the Stan documentation, case studies and tutorials. The NCP seems to address a number of model pathologies.

2 Likes

Thanks, @jgoldberg.

Just a warning – each individual context within a hierarchal model is parameterized separately, and in general they may have to be parameterized differently to achieve optimal performance. I discuss these details in the case study linked above.

Interactions are a different beast altogether as they arise only when considering multiple factors with overlapping levels. The standard methods automated in tools like brms are only approximations, and as such they have to be handled carefully. In other words there’s a lot of complexity hiding which is harder to avoid when you starting building models on your own. For a thorough discussion of these issues with lots of example code see Impact Factor. It’s long but builds all of the concepts up piece by piece.

1 Like

Many thanks for the links @jgoldberg and @betanalpha, you’ve given me lots to read and digest! I will see how I get on.