Brms: does the lognormal part of the hurdle_lognormal() regression include zeros into analysis?

I have a zero-inflated Y-variable, 50% of its values are zeros.

I thought that the lognormal part of the hurdle_lognormal model analyses only positive Y values; however, today I gave a simple try with two models:

m1 = brm(y ~ x, family = lognormal(), data = filter(data, y > 0))

m2 = brm(y ~ x, family = hurdle_lognormal(), data = data)

After checking conditional_effects() plots, the m1 model y-axis estimates are about 2 times higher. Is it so that the hurdle_lognormal() lognormal part includes zeros in analysis and therefore the estimates are lower?

Thus, if I would like to analyse zeros and positive values of Y separately, I have to use two different models? Or in other words, hurdle_lognormal() is not suitable for this.

It would be much easier to see what is going on if you provided a reproducible example of the issue.

A hurdle-lognormal model is similar (at its most basic implementation) to running separate Bernoulli and lognormal regressions, which can be shown by this example:

set.seed(2020)

### create data according to the hurdle-lognormal distribution
# probability of a zero
pi <- 0.3
# lognormal mean
mu_log <- 2
sigma_log <- 0.2

# generate data
N <- 1000
y <- (1 - rbinom(N, 1, prob = pi)) * rlnorm(N, mu_log, sigma_log)

### fit models
# m1 = hurdle_lognormal
d <- data.frame(y=y)

m1 <- brms::brm(
  y ~ 1, data = d, 
  family="hurdle_lognormal",
  cores = 4
)

summary(m1)


# output
Population-Level Effects: 
          Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
Intercept     2.00      0.01     1.99     2.02       4157 1.00

Family Specific Parameters: 
      Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
sigma     0.20      0.01     0.19     0.21       3496 1.00
hu        0.29      0.01     0.26     0.32       3666 1.00

# m2 = lognormal
d_g_0 <- data.frame(y=y[y>0])

m2 <- brms::brm(
  y ~ 1, data = d_g_0, 
  family="lognormal",
  cores = 4
)

summary(m2)

#output
Population-Level Effects: 
          Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
Intercept     2.00      0.01     1.99     2.02       3701 1.00

Family Specific Parameters: 
      Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
sigma     0.20      0.01     0.19     0.21       2621 1.00

As you can see, the estimates for the lognormal parts of both models are exactly the same.

1 Like

Thank you very much for providing a reproducible example! The model summaries match, however, conditional effects plots differ. How can it be so?

I edited your example by adding a categorical X variable for plotting.

set.seed(2020)

### create data according to the hurdle-lognormal distribution
# probability of a zero

pi <- 0.3
# lognormal mean
mu_log <- 2
sigma_log <- 0.2

# generate data
N <- 1000
y <- (1 - rbinom(N, 1, prob = pi)) * rlnorm(N, mu_log, sigma_log)
x <- rep(letters[1:2], each = 1, times = 500)

### fit models

# m1 = hurdle_lognormal
d <- data.frame(y=y, x=x)

m1 <- brms::brm(
  y ~ x, data = d, 
  family="hurdle_lognormal",
  cores = 4
)

summary(m1)

# m2 = lognormal
d_g_0 <- d %>% filter(y > 0)

m2 <- brms::brm(
  y ~ x, data = d_g_0, 
  family="lognormal",
  cores = 4
)

summary(m2)

#let's plot
    conditional_effects(m1)
    conditional_effects(m2)

m1 - hurdle lognormal - smaller y-axis values

m2 - lognormal - higher y-axis values

I played with the arguments of conditional_effects() function, however, plots always differed.

Then I tried add_fitted_draws() from model posterior, which game similar result.

newdata = expand.grid(x = c("a", "b"))
posterior = add_fitted_draws(m1, newdata = newdata, re_formula = NULL, scale = "response", dpar = NULL, value = "y")
median_qi(posterior)

fitted_draws

Then I tried to manually calculate y-variable estimates and these match the plot of lognormal() model

posterior2 = posterior_samples(m1) %>% mutate(a = exp((b_Intercept) + 1/2 * (sigma)**2), b = exp((b_Intercept + b_xb) + 1/2 * (sigma)**2)) %>% dplyr::select(a, b)
posterior_summary(posterior2) %>% as.data.frame()

calculated values

Conclusions

  • Though I am not sure about my manual calculations, their results match lognormal model conditional effects plot.
  • In contrast, add_fitted_draws() and conditional_effects() plots give weird results for hurdle_lognormal model (m1).
  • I also tried add_fitted_draws() and manual calculations on lognormal model (m2) and both of these matched the conditional effects plot of m2!
  • I run the whole example with another continuous predictor and there was also the same discrepancy.

Why hurdle_lognormal model gives smaller y values? Or what am I doing wrong here?

@di4oi4 That does seem a bit strange. Unfortunately, I do not have the latest version of brms and cannot use the conditional_effects() function to explore more. Maybe @paul.buerkner has time to help?

1 Like

Condtional_effects by default computes the mean of the posterior predictive distribution which is a combination of the lognormal and the hurdle part in this case. Hence I am not surprised that the plots have lower values than the pure log normal part where you excluded the zeros.

2 Likes

Thank you very much! Thus, dpar = NULL gives means including all Y variable values (zeros and non-zeros), and dpar = “hu” gives values only for zero-probability? Am I correct that just reporting dpar = NULL results may be enough for publishing in some cases (it can be considered as a full analysis, including all Y variable values)?

What is enough for publishing is not for me to judge but other you are right with the usage of the dpar.

1 Like

Thanks a lot! Finally, could you recommend some materials about brms families that cover such things like I asked. I also use zero inflated negative binomial modelling and I have the same about this (zeros included or excluded in dpar = NULL?

Always included if something affects the mean.