Zero-inflated poisson model for count data

I have prepared 3 simulated count data sets with low, mid and high levels of zero counts. Fitting models with these data sets in brms with the zero-inflated Poisson family gives good fits with the high and mid levels of zero counts, but not with the low levels of zero counts. My data consists of counts of birds of multiple species at multiple sites and on multiple days. Most of the sets of species counts have high to extreme levels of zero counts. A few have low levels of zero counts, even with mean equal to variance, as expected from a true Poisson distribution. I want to fit a hierarchical model, grouped by species and site, and with added covariates.

My question is: do I have to worry about fitting the few count sets that are not zero-inflated together with those that are zero-inflated?

fit1 <-
  brm(data = my_data,
      family = zifamily,
      formula = formula3,
      prior = zip_prior,
      iter = 2000, warmup = 1000, thin = 1, chains = 4, cores = 4,
      seed = 9
  	)

If possible, add also code to simulate data or attach a (subset of) the dataset you work with.

Operating System: Linux Ubuntu 22.04
Interface Version:
Compiler/Toolkit:

Iā€™m not sure what your exact question is, but Iā€™m hoping @jsocolar can help.

Hi @pitangus (and nice handle btw!).

Let me try to unpack the question a bit, because there are a few possible questions wrapped together here.

If I understand this part correctly, you are suggesting that you cannot recover known true parameter values when fitting simulated data (when the true rate of zero inflation is low). If thatā€™s an accurate summary, it must mean one of 5 things:

  1. Thereā€™s a bug in your simulation code
  2. Thereā€™s a flaw in the way that you are concluding that parameter recovery is poor
  3. You are using a prior that is inconsistent with the true value.
  4. There is something tricky about this posterior that makes it hard for Stan to sample. In this case, you would likely see some warnings from brms about divergences, poor ESS, bad r-hat, or something along those lines.
  5. Thereā€™s a bug in the brms/Stan code being used to fit the model.

If you arenā€™t getting warned about a problematic fit, itā€™s very likely that 1, 2, or 3 is the problem. If you can show a reproducible example that includes the data simulation and model fitting together in a single script that is runnable as is in a fresh R session, then we can run it and troubleshoot it.

If you have sets of counts that you expect to differ in their rates of zero-inflation, then you should model that explicitlyā€“i.e. regress the zero-inflation probability itself on what set of counts a particular count belongs to.

1 Like

Hi Jacob,
Thanks for your very response. I have been following your work.
I built the simulations in Python on the basis of Yamaura et.al. 2011 for MSAMā€™s and saved them as csvā€™s for loading in R. The resulting simulations seem to provide a good range of possible replicate counts. I used the ratio of count variance to count mean as a check of how close the sets are to a true Poisson distribution. So I donā€™t think that the simulations are the problem.
I got r hat = 1.00 in all cases, and with reasonable ESSā€™s (I think). Here is the summary the problematic fit:

Family: zero_inflated_poisson 
##   Links: mu = log; zi = logit 
## Formula: count ~ 1 + sday 
##          zi ~ 1
##    Data: my_data (Number of observations: 100) 
##   Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
##          total post-warmup draws = 4000
## 
## Population-Level Effects: 
##              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept        0.33      0.23    -0.14     0.76 1.00     1585     2198
## zi_Intercept    -1.13      1.41    -2.69    -0.30 1.00     1015      462
## sday             0.01      0.04    -0.05     0.09 1.00     1697     1723

I used ppc_check to see the fit. Here is the graph:


The posterior has too many counts of ā€˜1ā€™, that are not present in the data. This suggests to me that the problem is in Poisson part, and not in the zi, as it looks like the fit has gone for a more pure Poisson distribution. I hope this point is clear. This is born out by the variance/mean ratio of 1.15, compared to 1.45 of the original data. But than why is the zi intercept such a large negative value?
Priors for these models (I have tried several) are a problem for me. I looked for suggestions online or in the literature, but couldnā€™t find anything relevant. Iā€™d appreciate it if you have any suggestions.
Iā€™m a bit worried about separating the counts for particular species as it suggests a type of manipulation of the data. If I donā€™t have a choice, I may have to, as long as I report the procedure. I believe that I can combine models for the purpose of inference later. Am I correct?
I appreciate your input. Iā€™m not sure how to share code. Iā€™d be glad to continue by email, if thatā€™s possible.
All the best, Peter Geffen

A correction: the variance/mean ratio of the predicted counts is 1.48, not 1.15, compared to 1.15, not 1.45, for the original data

If you can show your python simulation code, the answer will become clearer. You can share code by pasting it straight into discourse enclosed in triple backtics, as youā€™ve done for the R code and output above.

It appears that whatā€™s happening here is that the best-fitting zero-inflated poisson has a relatively low rate of zero inflation (corresponding to roughly the inverse logit of -1.13, or 25%), and then the remainder of the data are not perfectly fit by a Poisson, because they have too few ones compared to the Poisson expectation. Heuristically, the zero-inflated poisson model is doing something like ā€œdecide how many zeros to model as zero-inflated such that the remainder of the data can be fit well by some Poisson, and then fit that Poissonā€ (in reality it doesnā€™t do these steps sequentially, but rather jointly in one step). What youā€™re seeing is that thereā€™s no proportion of zeros to remove that leaves the remainder of the data looking perfectly Poisson.

Unfortunately, thereā€™s no general-purpose prior that I can recommend on the Poisson partā€“my only advise is to remember that the model here is on the log scale, so make sure you are thinking in those terms about the priors (when you exponentiate wide/vague priors, things can get a bit crazy!). Also, be aware that by default, your prior on the intercept in brms will be a prior on the intercept when all predictors are held at their means, NOT when all predictors are set to zero! If you want to put a prior on the intercept when all predictors are set to zero, add 0 + Intercept to your model formulas. If you donā€™t intend to model covariates on the zi piece, a logistic(0,1) prior can be a good default (this is uniform on the probability scale).

Fitting the models separately and combining later means that any shared covariates (e.g. in the Poisson part) get fitted separately and do not mutually inform across the separate models. So itā€™s like fitting an interaction of species with all of the effects in the model. Thatā€™s not necessarily a bad thing, but it is certainly different than fitting all species jointly. For what itā€™s worth, it does not feel at all like data manipulation to me to notice that zero-inflation probabilities might vary by species, and if so to model that. In your case, I would consider using a random effect like zi ~ (1 | species). However, itā€™s hard for me to imagine a scenario where I would expect the zi part to vary strongly by species without also expecting the Poisson part to vary by species. In that case, I would try something like:
count ~ 1 + sday + (1 + sday |g1| species), zi ~ 1 + (1 |g1| species)

Here, the |g1| is an example of the |<ID>| syntax from brms and is being used to model the random effects of species as correlated between the count and the zi formulas.

Hope some of this helps!

Your response certainly gave me plenty to work with. I definitely expect to add covariate to the zi part as it can represent the occupancy part of the model.

I have attached the Jupyter Notebook code file for the simulations to this email. I hope this works for you.

(Attachment f32 - simulation single-species occupancy model - yamaura 2011.ipynb is missing)

Sending a notebook file is not authorised. here is the python script:
f32 - simulation single-species occupancy model - yamaura 2011.py (4.5 KB)

With the default priors, the zi_Intercept = -0.86 and the inverse logit is 0.297, which is much lower than the occupancy probability of the simulation of 0.6. However, the detected count was calculated by multiplying the Poisson abundance by a detection probability, p = 0.2. Interestingly, multiplying the occupancy probability of the simulation by the detection probability the result is 0.3, close to the zi_Intercept probability.

The mean, variance and ratio calculated of predicted draws and of the data. Again the mean of the predicted and the data are the same.

mean variance ratio
predicted 1.09 1.68 1.55
data 1.09 1.25 1.15

The problem is that the Poisson part does not fit