For an analysis I need to compare biodiversity of macroinvertebrate communities between different zones of a river. The metric we’re using to measure biodiversity is the number of observed species. Assume each zone has fifteen survey locations, at which it is possible to observe up to 80 different species. The point estimate of this metric is calculated by counting the number of species observed at each location in a zone, then taking the arithmetic mean.
During testing with synthetic data, I have found that my estimates of the structural zero probability \theta are positively biased if \theta_\textrm{true} is near zero, and negatively biased if \theta_\textrm{true} is near one. The average bias is small for each \theta, but that bias compounds into appreciable error when calculating the posterior distribution of the expected number of observed species.
The Stan code (see end of post) is largely derived from the ZIP example given in the Stan User’s Guide. A siloed ZIP model is fit to each unique combination of zone and species.
Example fit to synthetic data
The true values of theta are shown below. Zones 1 through 3 increase in biodiversity.
The individual \lambda and \theta estimates seem to be okay, but some averaging shows that median posterior samples of \theta have consistent bias.
Estimation Error Summary Statistics:
==================================================
THETA ERRORS:
Zone 1: Mean=-0.0169, Std=0.0877
Zone 2: Mean=-0.0000, Std=0.1255
Zone 3: Mean=0.0173, Std=0.1001
LAMBDA ERRORS:
Zone 1: Mean=0.5983, Std=8.7731
Zone 2: Mean=-0.4314, Std=6.1236
Zone 3: Mean=-0.0293, Std=1.9281
It can be shown that the expected number of observed species R is given by:
where T=80 is the number of species (taxa) that we can possibly observe. The true value of R is shown in the plots below as a green line, along with the posterior distribution. No warning messages are returned by Stan, and the .diagnose() method doesn’t indicate any issues, but the coverage of HDI estimates for R in Zone 1 (high \theta) and Zone 3 (low \theta) are poor.
What I’ve tried
I’ve tried changing the prior for \theta from beta(1, 1) to beta(0.7, 0.7) to correct the bias, but that didn’t entirely fix the issue (and it seems like there is probably a better way to improve the estimation). Any suggestions would be welcome.
Stan code
data {
int<lower=0> N;
int<lower=1> n_species;
int<lower=1> n_zones;
array[N] int<lower=0> y;
array[N] int<lower=1, upper=n_species> species_id;
array[N] int<lower=1, upper=n_zones> zone_id;
}
parameters {
array[n_species, n_zones] real<lower=0, upper=1> theta;
array[n_species, n_zones] real log_lambda;
}
transformed parameters {
array[n_species, n_zones] real<lower=0> lambda;
lambda = exp(log_lambda);
}
model {
// priors
for (s in 1:n_species) {
for (z in 1:n_zones) {
theta[s, z] ~ beta(1, 1);
log_lambda[s, z] ~ normal(log(30), 1);
}
}
// likelihood
for (n in 1:N) {
int s = species_id[n];
int z = zone_id[n];
if (y[n] == 0) {
target += log_sum_exp(log(theta[s, z]),
log1m(theta[s, z])
+ poisson_lpmf(y[n] | lambda[s, z]));
} else {
target += log1m(theta[s, z])
+ poisson_lpmf(y[n] | lambda[s, z]);
}
}
}




