Hello everyone,
I’m a Stan novice, and trying to fit my first model. I was hoping folk here could give me:
(a) some reassurance that the model is doing what I think it is, and
(b) pointers for improving model efficiency.
Background
I have occupancy data for 324 plant species at 128 sites, but only as presence/absence, without a detection history. The sites are small and the surveys were exhaustive, so the assumption that detection = 1 is hopefully fairly reasonable.
Ultimately, I’m looking to determine the influence of site context on species occupancy.
Model (see code below)
I want to model plant species’ occupancy as influenced by 13 potential predictors from various site and landscape factors (I’m hoping to drop some of these during model refinement).
I anticipate that different species will respond to different predictors, so I’m seeking species-specific intercept and slope terms.
I was planning to have some interactions as well, but was hoping to get the model working with just additive terms initially.
Questions
-
Does this model do what I want it to? I extracted species-specific intercept and slope estimates from preliminary runs, although most have very wide CIs spanning 0. I have a bit of experience fitting MSOMs in JAGS with species’ detection histories, with the data in 3D arrays. I assumed that in this case I can simplify the data to a single long vector because I only have a single presence/absence record for each species. However, will the model still recognise that multiple species records are from the same site? Do I need to include some sort of site identifier in the model? Would this help to narrow the CIs? (I did try this with a 2D site x species matrix, but was struggling to get the code to work, and I’m not sure the output would have been any different from the version below anyway.)
-
Can I make this model structure more efficient? A preliminary run with 1k iterations worked (run time ~4 hours), but had poor Rhat and ESS values. I increased to 10k iterations, but it was slow (4 chains = 50, 55, 74 and 107 hours respectively). And I think RStudio has been caught in some sort of fatal memory issue(?) The chains finished a week ago, but RStudio is still working on something. I’ve read that vectorising the code can improve efficiency, but when I’ve tried this, I get an “expression ill formed” error, and I’m not certain why? Is it possible to vectorise this structure, whilst retaining species-specific terms?
Any suggestions and advice would be very welcome. Many thanks.
Code
// Model
data {
int N; // Number of rows in data
int N_sp; // Number of species (324)
int sp [N]; // Vector length N of species ID codes
int<lower=0, upper=1> Occ [N]; // Vector length N of 0/1 occupancy
// Predictors, each a vector of length N
vector [N] p1;
vector [N] p2;
vector [N] p3;
vector [N] p4;
vector [N] p5;
vector [N] p6;
vector [N] p7;
vector [N] p8;
vector [N] p9;
vector [N] p10;
vector [N] p11;
vector [N] p12;
vector [N] p13;
}
parameters {
// One intercept for each species
real intercept[N_sp];
// Slope terms for each species
real slope_1[N_sp];
real slope_2[N_sp];
real slope_3[N_sp];
real slope_4[N_sp];
real slope_5[N_sp];
real slope_6[N_sp];
real slope_7[N_sp];
real slope_8[N_sp];
real slope_9[N_sp];
real slope_10[N_sp];
real slope_11[N_sp];
real slope_12[N_sp];
real slope_13[N_sp];
// Mean of intercepts
real mean_intercept;
// Mean of slopes
real mean_slope;
// Scale of intercept and slope terms (common to all species)
real<lower=0> mean_scale;
}
model {
// Likelihood, with species-specific intercept and slope
for (i in 1:N) {
Occ[i] ~ bernoulli_logit(intercept[sp[i]] +
slope_1[sp[i]] * p1[i] +
slope_2[sp[i]] * p2[i] +
slope_3[sp[i]] * p3[i] +
slope_4[sp[i]] * p4[i] +
slope_5[sp[i]] * p5[i] +
slope_6[sp[i]] * p6[i] +
slope_7[sp[i]] * p7[i] +
slope_8[sp[i]] * p8[i] +
slope_9[sp[i]] * p9[i] +
slope_10[sp[i]] * p10[i] +
slope_11[sp[i]] * p11[i] +
slope_12[sp[i]] * p12[i] +
slope_13[sp[i]] * p13[i]);
}
// Priors
// Species-specific intercept and slope priors, with common
// location and scale.
for (j in 1:N_sp) {
intercept[j] ~ cauchy(mean_intercept, mean_scale);
slope_1[j] ~ cauchy(mean_slope, mean_scale);
slope_2[j] ~ cauchy(mean_slope, mean_scale);
slope_3[j] ~ cauchy(mean_slope, mean_scale);
slope_4[j] ~ cauchy(mean_slope, mean_scale);
slope_5[j] ~ cauchy(mean_slope, mean_scale);
slope_6[j] ~ cauchy(mean_slope, mean_scale);
slope_7[j] ~ cauchy(mean_slope, mean_scale);
slope_8[j] ~ cauchy(mean_slope, mean_scale);
slope_9[j] ~ cauchy(mean_slope, mean_scale);
slope_10[j] ~ cauchy(mean_slope, mean_scale);
slope_11[j] ~ cauchy(mean_slope, mean_scale);
slope_12[j] ~ cauchy(mean_slope, mean_scale);
slope_13[j] ~ cauchy(mean_slope, mean_scale);
}
// Priors for common location and scale.
// Using Cauchy(0, 2.5) distribution
// following Bob Carpenter's MSOM Case Study.
// https://mc-stan.org/users/documentation/case-studies/dorazio-royle-occupancy.html
mean_intercept ~ cauchy(0, 2.5);
mean_slope ~ cauchy(0, 2.5);
mean_scale ~ cauchy(0, 2.5);
}