I have prepared 3 simulated count data sets with low, mid and high levels of zero counts. Fitting models with these data sets in brms with the zero-inflated Poisson family gives good fits with the high and mid levels of zero counts, but not with the low levels of zero counts. My data consists of counts of birds of multiple species at multiple sites and on multiple days. Most of the sets of species counts have high to extreme levels of zero counts. A few have low levels of zero counts, even with mean equal to variance, as expected from a true Poisson distribution. I want to fit a hierarchical model, grouped by species and site, and with added covariates.
My question is: do I have to worry about fitting the few count sets that are not zero-inflated together with those that are zero-inflated?
Let me try to unpack the question a bit, because there are a few possible questions wrapped together here.
If I understand this part correctly, you are suggesting that you cannot recover known true parameter values when fitting simulated data (when the true rate of zero inflation is low). If thatās an accurate summary, it must mean one of 5 things:
Thereās a bug in your simulation code
Thereās a flaw in the way that you are concluding that parameter recovery is poor
You are using a prior that is inconsistent with the true value.
There is something tricky about this posterior that makes it hard for Stan to sample. In this case, you would likely see some warnings from brms about divergences, poor ESS, bad r-hat, or something along those lines.
Thereās a bug in the brms/Stan code being used to fit the model.
If you arenāt getting warned about a problematic fit, itās very likely that 1, 2, or 3 is the problem. If you can show a reproducible example that includes the data simulation and model fitting together in a single script that is runnable as is in a fresh R session, then we can run it and troubleshoot it.
If you have sets of counts that you expect to differ in their rates of zero-inflation, then you should model that explicitlyāi.e. regress the zero-inflation probability itself on what set of counts a particular count belongs to.
Hi Jacob,
Thanks for your very response. I have been following your work.
I built the simulations in Python on the basis of Yamaura et.al. 2011 for MSAMās and saved them as csvās for loading in R. The resulting simulations seem to provide a good range of possible replicate counts. I used the ratio of count variance to count mean as a check of how close the sets are to a true Poisson distribution. So I donāt think that the simulations are the problem.
I got r hat = 1.00 in all cases, and with reasonable ESSās (I think). Here is the summary the problematic fit:
Family: zero_inflated_poisson
## Links: mu = log; zi = logit
## Formula: count ~ 1 + sday
## zi ~ 1
## Data: my_data (Number of observations: 100)
## Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
## total post-warmup draws = 4000
##
## Population-Level Effects:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept 0.33 0.23 -0.14 0.76 1.00 1585 2198
## zi_Intercept -1.13 1.41 -2.69 -0.30 1.00 1015 462
## sday 0.01 0.04 -0.05 0.09 1.00 1697 1723
I used ppc_check to see the fit. Here is the graph:
The posterior has too many counts of ā1ā, that are not present in the data. This suggests to me that the problem is in Poisson part, and not in the zi, as it looks like the fit has gone for a more pure Poisson distribution. I hope this point is clear. This is born out by the variance/mean ratio of 1.15, compared to 1.45 of the original data. But than why is the zi intercept such a large negative value?
Priors for these models (I have tried several) are a problem for me. I looked for suggestions online or in the literature, but couldnāt find anything relevant. Iād appreciate it if you have any suggestions.
Iām a bit worried about separating the counts for particular species as it suggests a type of manipulation of the data. If I donāt have a choice, I may have to, as long as I report the procedure. I believe that I can combine models for the purpose of inference later. Am I correct?
I appreciate your input. Iām not sure how to share code. Iād be glad to continue by email, if thatās possible.
All the best, Peter Geffen
If you can show your python simulation code, the answer will become clearer. You can share code by pasting it straight into discourse enclosed in triple backtics, as youāve done for the R code and output above.
It appears that whatās happening here is that the best-fitting zero-inflated poisson has a relatively low rate of zero inflation (corresponding to roughly the inverse logit of -1.13, or 25%), and then the remainder of the data are not perfectly fit by a Poisson, because they have too few ones compared to the Poisson expectation. Heuristically, the zero-inflated poisson model is doing something like ādecide how many zeros to model as zero-inflated such that the remainder of the data can be fit well by some Poisson, and then fit that Poissonā (in reality it doesnāt do these steps sequentially, but rather jointly in one step). What youāre seeing is that thereās no proportion of zeros to remove that leaves the remainder of the data looking perfectly Poisson.
Unfortunately, thereās no general-purpose prior that I can recommend on the Poisson partāmy only advise is to remember that the model here is on the log scale, so make sure you are thinking in those terms about the priors (when you exponentiate wide/vague priors, things can get a bit crazy!). Also, be aware that by default, your prior on the intercept in brms will be a prior on the intercept when all predictors are held at their means, NOT when all predictors are set to zero! If you want to put a prior on the intercept when all predictors are set to zero, add 0 + Intercept to your model formulas. If you donāt intend to model covariates on the zi piece, a logistic(0,1) prior can be a good default (this is uniform on the probability scale).
Fitting the models separately and combining later means that any shared covariates (e.g. in the Poisson part) get fitted separately and do not mutually inform across the separate models. So itās like fitting an interaction of species with all of the effects in the model. Thatās not necessarily a bad thing, but it is certainly different than fitting all species jointly. For what itās worth, it does not feel at all like data manipulation to me to notice that zero-inflation probabilities might vary by species, and if so to model that. In your case, I would consider using a random effect like zi ~ (1 | species). However, itās hard for me to imagine a scenario where I would expect the zi part to vary strongly by species without also expecting the Poisson part to vary by species. In that case, I would try something like: count ~ 1 + sday + (1 + sday |g1| species), zi ~ 1 + (1 |g1| species)
Here, the |g1| is an example of the |<ID>| syntax from brms and is being used to model the random effects of species as correlated between the count and the zi formulas.
Your response certainly gave me plenty to work with. I definitely expect to add covariate to the zi part as it can represent the occupancy part of the model.
I have attached the Jupyter Notebook code file for the simulations to this email. I hope this works for you.
(Attachment f32 - simulation single-species occupancy model - yamaura 2011.ipynb is missing)
With the default priors, the zi_Intercept = -0.86 and the inverse logit is 0.297, which is much lower than the occupancy probability of the simulation of 0.6. However, the detected count was calculated by multiplying the Poisson abundance by a detection probability, p = 0.2. Interestingly, multiplying the occupancy probability of the simulation by the detection probability the result is 0.3, close to the zi_Intercept probability.
The mean, variance and ratio calculated of predicted draws and of the data. Again the mean of the predicted and the data are the same.