Zero inflated mediator as a latent variable in brms

As part of a review of a journal article, I’m trying to puzzle through thinking about zero inflated mediators. Given work I’ve been doing with SEM, it seems like we should be able to think about them as latent variables, no? Where the observed variable is a single indicator with a path fixed to 1, no intercept, and we apply a zero inflated distribution.

Consider the following example. Let’s say sea surface temperature influences sea mink abundance which influences snail abundance (hungry hungry sea minks! sadly now exctinct).

BUT - our observations of sea minks is zero inflated.

library(dplyr)
library(brms)

set.seed(31415)

dat <- tibble(
  sst = rnorm(100, 15, 4),
  seaminks = rpois(100, exp(sst/10-2)),
  seaminks_obs = ifelse(runif(100)>0.2, seaminks, 0),
  snails = rpois(100, exp(-1*seaminks+3)),
  true_seaminks = as.numeric(NA),
  seaminks_nonzero = as.numeric(seaminks_obs>0)
)

The model I would conceive of is

sst → seaminks → snails, all relationships poisson
seaminks → 1*seaminks_obs, zero inflated poisson

Using Full Luxury Bayesian Structural Equation Modeling with brms as a guide for latent variables, I’d put together a model more or less like this:

###
# ZI mediation in brms with LV
###
seamink_mod <- 
  bf(true_seaminks | mi() ~ sst, family = poisson(link = "log")) +
  bf(seaminks_obs ~ 0 + mi(true_seaminks), family = zero_inflated_poisson()) +
  bf(snails ~ mi(true_seaminks), family = poisson(link = "log")) +
  set_rescor(FALSE)


seamink_prior <- 
  prior(constant(1), 
        coef = "mitrue_seaminks",
        resp = "seaminksobs", 
        class = "b") +
  prior(gamma(11,11), 
        class = "sigma",
        resp = "trueseaminks")

This doesn’t work, however, as we can’t have mi() working with a poisson distribution.

So instead

###
# ZI mediation in brms with LV
###
seamink_mod <- 
  bf(true_seaminks | mi() ~ sst, family = gaussian(link = "log")) +
  bf(seaminks_obs ~ 0 + mi(true_seaminks), family = zero_inflated_poisson()) +
  bf(snails ~ mi(true_seaminks), family = poisson(link = "log")) +
  set_rescor(FALSE)


seamink_prior <- 
  prior(constant(1), 
        coef = "mitrue_seaminks",
        resp = "seaminksobs", 
        class = "b") +
  prior(gamma(11,11), 
        class = "sigma",
        resp = "trueseaminks")

seamink_brm <- brm(seamink_mod,
                   prior = seamink_prior,
                   data = dat,
                   chains = 2)

This samples, although not super well.

I’m curious about two things

  1. Does this seem like an interesting way forward, and how can we make it sample better (some of the coefficients here have posteriors that are still too narrow - but I think that’s due to poor sampling of the posterior).

  2. Is this the right way to deal with an endogenous count latent variable, or is there a better approach?


Note - for completeness, there is another suggestion in the literature about zero inflated endogenous variables in SEM. I’ve implemented it below, and it works pretty well. One should be able to have seaminks_nonzero also predicted by seaminks_obs and sst, but, I kept getting wild coefficients there, which I suspect means we need good priors for that. The idea is the following model which just is currently missing
seaminks_nonzero (1,0) ~ sst + seaminks_obs which would be… bernouli?

###
# ZI mediation in brms with 
# Jian et al approach
# https://pmc.ncbi.nlm.nih.gov/articles/PMC10686235/
###


seamink_jiang_mod <- 
  bf(seaminks_obs ~ sst, family = zero_inflated_poisson()) +
  bf(snails ~ seaminks_nonzero + seaminks_obs, family = poisson(link = "log")) +
  set_rescor(FALSE)


seamink_jiang_brm <- brm(seamink_jiang_mod,
                         data = dat,
                         chains = 2)

seamink_jiang_brm

I’ve got a some quick comments aimed at making sure that any ensuing discussion really answers the right question for you.

The data simulation has seaminks_obs = ifelse(runif(100)>0.2, seaminks, 0). This says that the true seamink population is either observed exactly or you don’t observe a single individual, which seems implausible.

Later you write

This seems to suggest that the number of observed seaminks is ZIP distributed around the true number of seaminks (but maybe I’m misunderstanding the notation). This shouldn’t be the case, as the observed number presumably should never exceed the true population size. Something like a zero-inflated binomial observation model might be more appropriate. Or else maybe you mean that you either observe all of them or none of them, without additional Poisson variation?

Your log-link gaussian model assumes that there is some true density from which we observe a zero-inflated Poisson sample. This can be thought of as approximating the Poisson with a gaussian, or it can also be thought of as supposing that the true population density of seaminks is not a simple count, but rather is some average density over time, from which the Poisson observation (assuming we are not in the zero-inflated state) arises as the combination of some sampling process that gives us the number of individuals actually present as well as some observation process that determines the number of individual observed.

Lastly, the title of the post suggests that the latent true number of seaminks is itself intended to be zero-inflated (i.e. the zero-inflation arises at the level of the process and affects snails, rather than arising only at the level of the observation without affecting snails). This is different again from any of the models above.

The difficulty of using mi with poisson and/or zero-inflated latents is that Stan cannot handle discrete parameters; these need to be marginalized out. mi parameterizes the latents explicitly (i.e. without marginalizing), and therefore cannot handle discrete latents.

If we can pin down which model is intended here, we can almost certainly find a way to fit it, whether using brms (restricted to the case of continuous latents), or in raw Stan, where we can construct the necessary marginalizations ourselves.

Thanks for this! Let’s see

This says that the true seamink population is either observed exactly or you don’t observe a single individual, which seems implausible.

Yes, I could have added more sources of error. But, I was trying to keep the example simple. The not at all is a detection probability thing - I’m pondering this in the context of wildlife biology. Maybe you could observe the seaminks. Maybe it was a stormy day and they were all in their burrows so you could not. Not implausible at all! And classic zero inflation.

I am not implying the true number is zero inflated - merely the observed number due to detection probability issues.

To sum up and provide clarity, the true DAG is

sst → seaminks → snails

Seamink observations are ZIP. with either observing them or not observing any.

I do think the alternate scenario, where each seamink has an observation probability (they’re slippery little guys!) is also of interest, but not what I’m trying to handle here. Although would likely have a similar solution, I think. The problem again is wanting to use the correct predictor for the seamink->snail relationship, and a zero inflated variable would lead to a biased answer.

This seems to suggest that the number of observed seaminks is ZIP distributed around the true number of seaminks (but maybe I’m misunderstanding the notation). This shouldn’t be the case, as the observed number presumably should never exceed the true population size. Something like a zero-inflated binomial observation model might be more appropriate. Or else maybe you mean that you either observe all of them or none of them, without additional Poisson variation?

Huh - yeah, that’s interesting. I’m trying to think of the right way to incorporate the observed number as observed and the true number as latent here. I am going with the all or nothing approach here. Which, granted, is simplistic. I’m sure there are more nuanced ways to model.

I like the ZIB approach, though, with trials = number of seamink. In many ways, while it wouldn’t match my simulation, it would be a more interesting description of the problem - each seamink has an individual observation probability. Which is different than just the probability of no observations. The former wouldn’t necessarily create a zero inflated distribution. But would be worth modeling for sure!

HA - we are in deep detection probability territory. Let’s stick with the zero inflation for the moment, though, as it has applications other than just the detection probability problem.

Are you suggesting that the latent true seamink should isntead work as follows

true_seaminks → seaminks_nonzero, bernouli (as it’s just one coin flip)

Hrm. I feel like we can still use seaminks_obs as a second indicator…

This is fascinating.

Oh, and, I am trying to do this in brms if possible (trying to inform a review for a paper using brms) rather than going to raw stan.

If we are constraining this to the setting where we can use brms, then we require the true underlying seamink density (the thing that affects snails) to be continuous. As long as that assumption is palatable, then something like your log-link gaussian model should be what you want. Note that if you believe the true population size is close to Poisson-distributed, then you would want to model the standard deviation of the gaussian in such a way that it can scale appropriately with the predicted value. Then the observation arises as a zero-inflated Poisson from the true underlying density at the site.

Yup - that is correct. I should include modeling the variance as well… But, still curious what you think about the solution to seaminks as a latent in order to predict snails.