Simulating fake data for multi-species occupancy model (Dorazio et al's butterflies)

Hi everyone,

I’m in the process of learning to use STAN to build a multi-species, multi-site occupancy model but have a few questions with simulating fake data for my model.

My real data is similar in form to the example data collected and analyzed by Dorazio et al 2006 and reapproached with a STAN model written by Bob Carpenter here: Multiple Species-Site Occupancy Model. Eventually, I’d like to implement a model that also estimates a site-level covariate (i.e. the effect of habitat restoration on occupancy) and species-level covariate (i.e. the effect of species resource specialization on occupancy). However, since I’m still learning, I thought a good place to start would be to simulate fake data similar to the butterfly data (sites exchangeable and no species covariate) and aim to recover the parameters used to generate the fake data with the model written by BC. After that I’d hope to build on this to add site and species level covariates into the fake data simulation process and the model structure, and tailor the model to any other needs of my system, but I want to make sure I understand the basics first.

The main issue is that although the parameter values used to generate my fake data are more or less captured by the model estimates, there seems to be some bias in the output. Particularly, my input site-level detection effect (beta) is -1.75 (logit-scaled), but this true value is consistently at the low-edge of my model estimated 95% interval. Similarly my true rho value (correlation between occupancy and detection effects of species) is consistently at the high-edge of my model estimated 95% interval, and the estimated omega results in a consistent low estimate for my community size (E_N or E_N_2) of the model ouptut - where I know from my simulated data the true number of species in my simulated community.

I want to identify if there are any issues in my data simulation process that might be causing these slight mismatches. Has anyone else tried simulating data for this popular butterflies example? I will post my full data simulation code block below but a few places I might be going wrong:

  • I’m not sure if there’s a ‘best way’ to generate covariance in fake data, specifically in this example for covariance of species-level occupancy and detection. I generated occupancy effects for each species, calculated the z-score for each species, and then added that to the detection effect, as a result I get a correlation coefficient between the two of about 0.6. Seen in the line (where v is the pop mean species level effect on detection, sigma_v is the std deviation of species detection, uv_filtered[i, 2] is detection value for species i, and uv_filtered[i, 5] is the z-score for the same species occupancy value):
  uv_filtered[i, 2] <- rnorm(1, v, sigma_v) + (uv_filtered[i, 5]) # draw detection value from norm distr
  • after simulating my vector of species level occupancy and detection effects, the standard deviation for each of those vectors is a bit higher than the sigma parameter used to generate the vectors (and closer to the middle of my model’s estimate range; in the 50% interval rather than at the edges of the 97.5%). In general, should I hope to see my models capture the parameters used to generate the data (what I see as a population mean and standard deviation of species level effects) or the mean and variation of the sample of species level effects that was generated from that population mean and standard deviation? In practice here they are not that different and probably would be even less so with larger community sizes. Here I simulated species level occupancy effects for 65 species with a std dev of 1.25 (at the edge of the model estimate range for st dev of species level occ effect) but the standard deviation of the simulated vector is 1.39 (solidly in the 50% range of my model’s posterior probability distribution).

Thank you in advance for any help or advice! This is my first post on discourse so any suggestions for formatting or creating a clear question are appreciated.

My fake data simulation code (the STAN model is exactly the one written by Bob Carpenter for Dorazio’s butterflies Multiple Species-Site Occupancy Model):

### Multi-species occupancy model
## Jens Ulrich
## Project for FRST 509 course
## Constructing an occupancy model for multiple species


## drawing from example here, written by Bob Carpenter: 

### simulate test data

### Define study dimensions
N <- 120 # total number of species in the supercommunity
n_sites <- 20 # number of sites
n_surveys <- 15 # number of visits

## specify site-level occupancy and detection params
# here the sites are considered exchangeable and considered to share a single parameter
# for site-level occupancy and site-level detection
alpha <- 0.75 #  site-level occupancy (logit scaled)
beta <- -1.75 #  site-level detection (logit scaled)

# inv logit transformation function to view transformed values
ilogit <- function(u) { return(1 / (1 + exp(-u))); }

# just to view what the values translate to in log-odds
transformed_alpha <- ilogit(alpha) # site-level occupancy 
transformed_beta <- ilogit(beta) # site-level detection 

# There is a single parameter for the probability of a species being in the region
# Omega[element](0,1): availability of species
Omega <- 0.5
# define number of possible species occupying the sites
supercommunity_size <- N

# determine the number of species available
species_available <- vector()
for(i in 1:supercommunity_size){ # loop across each species in the supercommunity
  species_available[i] <- rbinom(1, 1, Omega)

# with the set seed should be 65 species
community_size <- sum(species_available > 0) # 0 == species not available

## simulate species-level effects
# for each species i, there are a pair of parameters for species-specific 
# occupancy and detection
# now simulate some species level occupancy (u) and detection (v) values
uv <- matrix(NA, nrow = community_size, ncol = 2)
# means for sigma u and sigma v will be centered at 0; these are not 
# estimated by the model written by BC, but in theory could be by tweaking
# the STAN model code if this ended up to be a variable of interest?
u <- 0 # species-level occupancy (logit scaled)
v <- 0 # species-level detection (logit scaled)
sigma_u <- 1.25 # std deviation of species occupancy
sigma_v <- 1.5 # std deviation of species detection

# species level occupancy
for(i in 1:community_size){ # for each species in the community, 
  uv[i, 1] <- rnorm(1, u, sigma_u) # draw occupancy value from norm distr

# Before simulating species level detection, implement 
# a process to generate correlation between u and v of each species

# use z scores to get a correlation between species detection and species occupancy
# first rank species occupancies by z-score
uv_filtered <- %>%
  mutate(sd = sd(V1),
         mean = mean(V1),
         z_score = (V1 - mean)/sd)

uv_filtered <- as.matrix(uv_filtered)

# add the z-score from corresponding species to the random detection to 
# draw detection value towards the occupancy value
for(i in 1:community_size){  # for each species in the community, 
  uv_filtered[i, 2] <- rnorm(1, v, sigma_v) + (uv_filtered[i, 5]) # draw detection value from norm distr

sd(uv_filtered[,1]) # should be 1.25
sd(uv_filtered[,2]) # was 1.5 but may be inflated from z score addition?
# if inflated from z-score, should this be the value I want my model to capture
# in the sigma_v estimate?

hist(uv_filtered[,1], xlim = c(-4,4)) # most species at about 50% of sites, some at 10 or 90
hist(uv_filtered[,2], xlim = c(-5,5))
plot(uv_filtered[,2] ~ uv_filtered[,1])
lm2 <- lm(uv_filtered[,2] ~ uv_filtered[,1])

# the output correlation between occupancy and detection is ~.613
rho_uv <- cor(uv_filtered[,1], uv_filtered[,2]) # correlation of species occupancy and detection

# and then each survey will need to be simulated for each 
# species across each site

# First simulation step is to simulate if the sites are occupied,
Z <- matrix(NA, nrow = community_size, ncol = n_sites)
for(i in 1:community_size){ # loop across species
  for(j in 1:n_sites){ # loop across sites
    Z[i, j] <- rbinom(1, 1, ilogit(uv_filtered[i,1] + alpha)) 
    # Z[i, j] <- rbinom(1, 1, ilogit(uv_filtered[i,1] + (ab[j,1]))) 
    # simulate occupancy states
    # for species i at site j, draw occupancy from a binomial dist with prob 
    # a[i] + u[i]

# and then simulation of the surveys at each site (given simulated occupancy)
det_data_3D <- array(NA, c(community_size, n_sites, n_surveys))
for(i in 1:community_size){ # loop across available species
  for(j in 1:n_sites){ # loop across sites
    for(k in 1:n_surveys){ 
      det_data_3D[i, j, k] <- Z[i, j] * rbinom(1, 1, ilogit(uv_filtered[i,2] + beta)) 
      # det_data_3D[i, j, k] <- Z[i, j] * rbinom(1, 1, ilogit(uv_filtered[i,2] + ab[j,2])) 
    } # multpiply by the occupancy state so that you can never detect species
    # if it is not present at the site

# Now add the values in the 3D array to get a single 2D matrix of summed detections
det_data_2D <- matrix(NA, nrow = community_size, ncol = n_sites)
for(i in 1:community_size){ # loop across available species
  for(j in 1:n_sites){ # loop across sites
    det_data_2D[i, j] <- sum(det_data_3D[i, j, 1:n_surveys])

# now visualize the simulated detection data that is dependent on 
# simulated occupancy states
# the first thing the authors do is reshape and plot the data itself
df_x2 <- melt(det_data_2D)
colnames(df_x2)[1] <- "species"
colnames(df_x2)[2] <- "site"
colnames(df_x2)[3] <- "detections"


# plot the data as in BC's example
# I chose more sites and more species to have a higher sample size
# notice 4 rows without any detections
# so only 61 species are detected from my simulated surveys
# but I would like my model to estimate 61 + 4 = 65 species as the total diversity
# that is, E_N & E_N_2 should ~65 and not 61.
# so my model should estimate (with these param values and seed values)
detections_heatmap_plot2 <-
  ggplot(df_x2, aes(as.factor(site), as.factor(species))) +
  geom_tile(aes(fill = detections), colour = "white") +
  scale_fill_gradient(low = "white", high = "black") +
  labs(x = "site number", y = "species number") +
  theme_bw() +
  theme(panel.border = element_blank(), panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(), axis.line = element_line(colour = "black")) +
  ggtitle("Detections of Species at Sites over Visits")

print(row_sums <- rowSums(det_data_2D)) # total detections each species

# need to remove rows with no detections from simulated data in det_data_2D 
# for model to run, (those 4 species that are not recovered)
det_data_2D_df <- 
det_data_2D_df <- det_data_2D_df[apply(det_data_2D_df[,-1], 1, function(x) !all(x==0)),]

det_data_2D_matrix <- as.matrix(det_data_2D_df)

# data to feed to the model
x <- det_data_2D_matrix # detection data
n <- nrow(x) # 61 species detected (but there are 65 available)
J <- n_sites # number of sites - set at 20
K <- n_surveys # number of surveys - set at 15
S <- supercommunity_size # set to be 120

my_dat <- c("x", "n", "J", "K", "S") 

param_names <- c("alpha", "beta", "Omega", "var sp occ", "var sp det", "rho_uv", "E_N_2 / community size")
parameters <- c(alpha, beta, Omega, sigma_u, sigma_v, rho_uv, community_size)
targets <-, parameters))

# compile model
stan_model <- stan_model("./multispecies_occupancy_with_covariate.stan")

# run the model (seems to need at least ~8000 iter to get sufficient n_eff)
sim_fit <- sampling(stan_model, data = my_dat, 
                    iter = 10000, chains = 4)

Apologies if the following is old news to you and you simplified your simulation code for illustrative purposes:

In the code you provided it looks like you simulate a single dataset from a fixed seed. No matter how many times you fit the model to this fixed dataset, you should recover very similar posteriors (identical in the limit of lots of posterior iterations). Thus, it wouldn’t be surprising if the true values were consistently towards the high or low end of their posterior intervals. To check whether the model is well calibrated, you should instead re-simulate the data many times, and fit the model to these repeated, independently simulated datasets. This process is known as simulation based calibration, and Stan ecosystem in R includes the SBC package to simplify the process and visualize the output. See Simulation Based Calibration for rstan/cmdstanr models • SBC

A reliable and widely used multivariate normal RNG in R is provided by MASS::mvrnorm. You way is also a valid way to simulate from a multivariate normal, but it requires a bit more thinking to ensure that you are simulating from the population-level variance-covariance matrix that you intend.


Thanks for the tips and the quick response!

I’m fairly knew to the workflow of including simulation to calibrate my models and overlooked the consequences of the fixed seed. As soon as I can clear up server space I’ll try re-simulating many unfixed, independent datasets and fitting the model to each. I imagine this should solve the problem of what I was thinking of as a biased estimate.

I was not aware of the multivariate normal RNG. This is a much better route I think since it looks like it gives some control over the covariance, rather than my approach which relies on a coincidental level of correlation that can be quantified - but not specified in advance.