Brms -- Zero inflated binomial -- trials -- Presence absence data

Hi there,

Looking for some advice.

I have a dataset of presence-absence data: species survey, one visit only, is the species detected? yes / no.

Willing to use a zero_inflated_binomial distribution.

So something like:

m <- brm(presence ~ depth + slope_log + ice_sd + ssh_mean, data = dat, family = zero_inflated_binomial())

I get an error saying I need to use trials.
Given that we did only one visit to each site to assess the presence of the species… is it correct to use the following?

m <- brm(presence | trials(1) ~ depth + slope_log + ice_sd + ssh_mean, data = dat, family = zero_inflated_binomial())

Hi @Charley_Gros,

Welcome to the Stan discourse :)

I haven’t actually used this but afaik this is how you do it. You can use pp_check(m) after drawing samples from the model to see whether it reproduces the observed data well (as it should.)

1 Like

Yes, that syntax looks correct.

1 Like

Someone can correct me if I am off here, but this seems like a very strange model. What you are proposing is a mixture of Bernoulli models. This is the relevant Stan function for the zero_inflated_binomial in brms:

real zero_inflated_binomial_logit_lpmf(int y, int trials,
                                         real theta, real zi) {
    if (y == 0) {
      return log_sum_exp(bernoulli_logit_lpmf(1 | zi),
                         bernoulli_logit_lpmf(0 | zi) +
                         binomial_lpmf(0 | trials, theta));
    } else {
      return bernoulli_logit_lpmf(0 | zi) +
             binomial_lpmf(y | trials, theta);
    }
  }

When trials is always 1, then the binomial_lpmf part could be replaced by bernoulli, and now you simply have a mixture of bernoullis, which seems weird to me…
Try a simple simulation:

y <- rbinom(1000,1,0.1)
d <- cbind.data.frame(y)
m1 <- brm(y | trials(1) ~ 1, data=d, family=zero_inflated_binomial, cores=4)
m2 <- brm(y | trials(1) ~ 1, data=d, family=binomial, cores=4)
m1
m2
s <- as_draws_df(m1)
s2 <- as_draws_df(m2)

#proportion of 1 in data
mean(d$y)

#proportion estimated from the zero_inflated_binomial
mean(plogis(s$b_Intercept) * (1-s$zi))

#proportion estimated from the binomial
mean(plogis(s2$b_Intercept))

image
image
image

Note that the sampling is worse for the zi model (low ESS) and the SE is large for both the Intercept and zi parameters. This is because it is a mixture of two Bernoulli’s. You can get almost the same mean estimate for the proportion of 1’s in the data, but the interpretation is weird, because the Intercept is now the log-odds of 1 for the data excluding the zero inflated data.

Perhaps if you explicitly modeled the zi component, then it would be ok… I’m really not sure…but it does strike me as a bit weird. I’d be surprised the standard errors wouldn’t always be high as which model is any given zero assigned to?

1 Like