Biased estimation of structural zero probability in zero-inflated Poisson

jackelsey · November 11, 2025, 10:01pm

For an analysis I need to compare biodiversity of macroinvertebrate communities between different zones of a river. The metric we’re using to measure biodiversity is the number of observed species. Assume each zone has fifteen survey locations, at which it is possible to observe up to 80 different species. The point estimate of this metric is calculated by counting the number of species observed at each location in a zone, then taking the arithmetic mean.

During testing with synthetic data, I have found that my estimates of the structural zero probability \theta are positively biased if \theta_\textrm{true} is near zero, and negatively biased if \theta_\textrm{true} is near one. The average bias is small for each \theta, but that bias compounds into appreciable error when calculating the posterior distribution of the expected number of observed species.

The Stan code (see end of post) is largely derived from the ZIP example given in the Stan User’s Guide. A siloed ZIP model is fit to each unique combination of zone and species.

Example fit to synthetic data

The true values of theta are shown below. Zones 1 through 3 increase in biodiversity.

The individual \lambda and \theta estimates seem to be okay, but some averaging shows that median posterior samples of \theta have consistent bias.

Estimation Error Summary Statistics:
==================================================
THETA ERRORS:
Zone 1: Mean=-0.0169, Std=0.0877
Zone 2: Mean=-0.0000, Std=0.1255
Zone 3: Mean=0.0173, Std=0.1001

LAMBDA ERRORS:
Zone 1: Mean=0.5983, Std=8.7731
Zone 2: Mean=-0.4314, Std=6.1236
Zone 3: Mean=-0.0293, Std=1.9281

It can be shown that the expected number of observed species R is given by:

\mathbb{E}[R] = \sum_{i=1}^{T} (1 - \theta_i)\left(1 - e^{-\lambda_i}\right)

where T=80 is the number of species (taxa) that we can possibly observe. The true value of R is shown in the plots below as a green line, along with the posterior distribution. No warning messages are returned by Stan, and the .diagnose() method doesn’t indicate any issues, but the coverage of HDI estimates for R in Zone 1 (high \theta) and Zone 3 (low \theta) are poor.

What I’ve tried

I’ve tried changing the prior for \theta from beta(1, 1) to beta(0.7, 0.7) to correct the bias, but that didn’t entirely fix the issue (and it seems like there is probably a better way to improve the estimation). Any suggestions would be welcome.

Stan code

data {
  int<lower=0> N;
  int<lower=1> n_species;
  int<lower=1> n_zones;
  array[N] int<lower=0> y;
  array[N] int<lower=1, upper=n_species> species_id;
  array[N] int<lower=1, upper=n_zones> zone_id;
}
parameters {
  array[n_species, n_zones] real<lower=0, upper=1> theta;
  array[n_species, n_zones] real log_lambda;
}
transformed parameters {
  array[n_species, n_zones] real<lower=0> lambda;
  lambda = exp(log_lambda);
}
model {
  // priors
  for (s in 1:n_species) {
    for (z in 1:n_zones) {
      theta[s, z] ~ beta(1, 1);
      log_lambda[s, z] ~ normal(log(30), 1);
    }
  }

  // likelihood
  for (n in 1:N) {
    int s = species_id[n];
    int z = zone_id[n];
    if (y[n] == 0) {
      target += log_sum_exp(log(theta[s, z]),
                            log1m(theta[s, z])
                              + poisson_lpmf(y[n] | lambda[s, z]));
    } else {
      target += log1m(theta[s, z])
                  + poisson_lpmf(y[n] | lambda[s, z]);
    }
  }
}

jsocolar · November 12, 2025, 3:01am

Sorry if I’m being dense, but what do the data y represent? Can you show the data simulation you’re using?

jackelsey · November 12, 2025, 1:56pm

Here’s an example I found online of the type of dataset I’m working with. These surveys are done to assess the ecological health of aquatic systems by evaluating how sediment contamination affects bottom-dwelling organisms. They either scoop up sediment at several locations, or deploy samplers like this. The samples are sent to a lab where someone counts all the different macroinvertebrate taxa that are present.

Here is the Python I’m using to generate synthetic data, if that’s helpful.

def generate_variable_richness_data(n_species=80, n_obs_per_combo=15):
    """
    Generate synthetic data for zero-inflated Poisson model with 3 zones of varying richness.
    
    Parameters:
    -----------
    n_species : int, number of species
    n_obs_per_combo : int, observations per species-zone combination
    
    Returns:
    --------
    dict : data dictionary ready for Stan model
    """
    
    n_zones = 3
    
    # Fixed seed for consistent parameters across function calls
    fixed_rng = np.random.default_rng(42)
    
    # Initialize parameter arrays
    theta_true = np.zeros((n_species, n_zones))
    lambda_true = np.zeros((n_species, n_zones))
    
    # Generate parameters for each zone with fixed seed
    for z in range(n_zones):
        if z == 0:  # Low richness zone
            theta_zone = fixed_rng.beta(3, 1, size=n_species)  # High theta (more zeros)
            lambda_zone = fixed_rng.lognormal(np.log(15), 0.8, size=n_species)
        elif z == 1:  # Medium richness zone
            theta_zone = fixed_rng.beta(2, 2, size=n_species)  # Moderate theta
            lambda_zone = fixed_rng.lognormal(np.log(25), 0.6, size=n_species)
        else:  # High richness zone
            theta_zone = fixed_rng.beta(1, 3, size=n_species)  # Low theta (fewer zeros)
            lambda_zone = fixed_rng.lognormal(np.log(35), 0.5, size=n_species)
        
        theta_true[:, z] = theta_zone
        lambda_true[:, z] = lambda_zone
    
    # Generate observations using original rng for different data realizations
    N = n_species * n_zones * n_obs_per_combo
    y = np.zeros(N, dtype=int)
    species_id = np.zeros(N, dtype=int)
    zone_id = np.zeros(N, dtype=int)
    loc_id = np.zeros(N, dtype=int)
    
    idx = 0
    for s in range(n_species):
        for z in range(n_zones):
            for l in range(n_obs_per_combo):
                species_id[idx] = s + 1  # Stan uses 1-indexed
                zone_id[idx] = z + 1     # Stan uses 1-indexed
                loc_id[idx] = l + 1      # Stan uses 1-indexed
                
                # Generate zero-inflated Poisson observation
                if rng.random() < theta_true[s, z]:
                    y[idx] = 0  # structural zero
                else:
                    y[idx] = rng.poisson(lambda_true[s, z])  # Poisson observation
                
                idx += 1
    
    # Prepare data for Stan
    data = {
        'N': N,
        'n_species': n_species,
        'n_zones': n_zones,
        'y': y.tolist(),
        'species_id': species_id.tolist(),
        'zone_id': zone_id.tolist(),
        'loc_id': loc_id.tolist()
    }
    
    # Store true parameters for comparison
    data['theta_true'] = theta_true
    data['lambda_true'] = lambda_true
    
    return data

jsocolar · November 12, 2025, 6:17pm

In the simulation, there is a hierarchical structure to the thetas and lambdas–thetas and lambdas from different zones are drawn from different distributions. In the modeling context, you don’t know what these distributions are, but you can reduce or eliminate bias by implementing some hierarchical structure to allow the model to estimate the shapes of these distributions.

That is, for the low richness zone, you’ve experimented with priors of beta(1,1) and beta(0.7,0.7), but the actual true values arise from beta(3,1). Allow the model to pool information over the low-richness-zone data to estimate a better hierarchical prior for this zone.

Said one final way–in your simulation, the top-level parameters aren’t the values of theta and lambda; they’re the parameters of the beta and lognormal distibutions. In your model, you aren’t giving your model a chance to estimate those top-level parameters at all; you’re fixing them to [1,1] for the beta distributions and [log(30), 1] for the lognormals. This mismatch is the source of the bias you’re seeing.

jackelsey · November 14, 2025, 4:48pm

Jacob, you are correct. Thank you for helping me, this was very instructive.

Adding hierarchical priors for the lambdas and thetas within each zone appears to have removed the estimation bias; HDI intervals for the expected number of observed taxa now have excellent coverage.

Topic		Replies	Views
Learning how to write a Zero Inflated Poisson model with stan? Modeling	10	3897	November 30, 2018
How to speed up estimation - zero inflated poisson? General	6	601	May 12, 2021
Zero one inflated beta regression in STAN Modeling	18	6690	January 14, 2020
Inconsistency in zero-inflated Poisson models? brms	3	1178	June 23, 2019
Centered parametrization throws errors in Zero inflated poisson regression Modeling	6	681	January 27, 2022

Biased estimation of structural zero probability in zero-inflated Poisson

Example fit to synthetic data

What I’ve tried

Stan code

Related topics