Predict.brmsfit: new level in a hurdle model

I’ve been using brms to fit a longitudinal hurdle model where the random effects from the two parts of the model are correlated. It seems to work really well. However, I’m struggling to get correct predictions for a new level with the predict method. I’m happy writing my own code for this but wanted to raise my findings here.

I’ll try to explain with 2 examples. Firstly, a gaussian mixed model where each patient (n = 100) has a (correlated) random intercept and slope. Each patient has 4 independent observations at time 0, and 4 independent observations at time 1.

library(brms)
set.seed(180)
u <- rnorm(100, sd = 2)
v <- rnorm(100, mean = -0.95 * u, sd = sqrt((1 - 0.95^2) * 4)) 
y0 <- rnorm(400, mean = u, sd = 1)
y1 <- rnorm(400, mean = u + v, sd = 1)
id <- 1:100
dat <- data.frame(id = rep(id, 8), 
                  time = rep(0:1, each = 400),
                  y = c(y0, y1))

ggplot2::ggplot(data = dat,
                mapping = aes(x = time, y = y, group = id)) +
  geom_line() +
  geom_point()

fit <- brm(y ~  1 + (1  + time | id),
           data = dat,
           family = gaussian())

When I try to predict the response at time 1 for a new patient (101) using the default method I get a prediction interval that is clearly too wide

predict(fit, 
        newdata = data.frame(id = 101, time = 1),
        allow_new_levels = TRUE)

whereas the sample_new_levels = “gaussian” option looks much better

predict(fit, 
        newdata = data.frame(id = 101, time = 1), 
        allow_new_levels = TRUE,
        sample_new_levels = "gaussian")

Now, I would like to use the sample_new_levels = “gaussian” for a hurdle model:

set.seed(180)
u <- rnorm(100, sd = 2)
v <- rnorm(100, mean = -0.95 * u, sd = sqrt((1 - 0.95^2) * 4)) # p(zero) is negatively correlated with Y*
ystar <- exp(rnorm(800, mean = u, sd = 1))
z <- rbinom(800, size = 1, prob = 1 - plogis(-1 + v)) # p(cross hurdle) = 1 - p(zero)
id <- 1:100

dat <- data.frame(id = rep(id, 8), 
                  y = z * ystar)


fit_hurdle <- brm(bf(y ~  1 + (1 | q | id),
                     hu ~ 1 + (1 | q | id)),
                  data = dat,
                  iter = 4000,
                  family = hurdle_lognormal())


predict(fit_hurdle, 
        newdata = data.frame(id = 101), 
        allow_new_levels = TRUE,
        sample_new_levels = "gaussian",
        robust = TRUE)

But this gives me values that are too small. It’s difficult to see from a plot of the data but I can simulate directly from the model

u <- rnorm(1e6, sd = 2)
v <- rnorm(1e6, mean = -0.95 * u, sd = sqrt((1 - 0.95^2) * 4)) # p(zero) is negatively correlated with Y*
ystar <- exp(rnorm(1e6, mean = u, sd = 1))
z <- rbinom(1e6, size = 1, prob = 1 - plogis(-1 + v)) # p(cross hurdle) = 1 - p(zero)
y <- z * ystar

quantile(y, probs = c(0.025, 0.5, 0.975))

It looks like the predict method is not taking into account correlation in the hurdle model, whereas it was for the gaussian model.

  • Operating System: macOS 10.14.6
  • brms Version: 2.10.0
1 Like

Sorry, can’t answer now, but maybe @jonah can help?

Hmm, I don’t have much time too look into this right now unfortunately, but I did try something quickly:

if I take the code above for simulating a new y and overlay quantiles of log(y) on top of the posterior predictive distribution of a new data point at new level 101 (also on the log-scale, so it’s easier to see), then I get something that looks pretty reasonable to me:

preds <- posterior_predict(fit_hurdle, 
                           newdata = data.frame(id = 101), 
                           allow_new_levels = TRUE
                           sample_new_levels = "gaussian")
log_preds <- log(preds)
colnames(log_preds) <- "log-prediction for id=101"

library(bayesplot)
log_y_quants <- quantile(log(y + 0.001), probs = c(0.025, 0.5, 0.975))
mcmc_hist(log_preds) + vline_at(log_y_quants, size = 1)

Many thanks for taking a look. I agree, I probably chose a poor example, or perhaps it’s not a big problem generally. Strictly speaking, though, I think there is an issue.

Below I have produced the posterior predictions manually from the posterior samples, where I ignore the correlation between the random effects. This gives me the same as the predict method…

### predict manually
ps <- posterior_samples(fit_hurdle)

sigma_error <- ps[,"sigma"]
sigma_u <- ps[,"sd_id__hu_Intercept"]
sigma_v <- ps[,"sd_id__Intercept"]
rho <- ps[,"cor_id__Intercept__hu_Intercept"]

n_mcmc <- length(rho)

x <- data.frame(id = NA, time = "2")

### simulate u_i' and v_i' (ignore rho)
u <- rnorm(n_mcmc, sd = sigma_u)
v <- rnorm(n_mcmc, sd = sigma_v)

### extract draws from xi = x*gamma
xi <- qlogis(fitted(fit_hurdle, 
                    newdata = x, 
                    re_formula = NA, 
                    dpar = "hu",
                    summary = FALSE))

### extract draws from eta = x*beta
eta <- fitted(fit_hurdle, 
              newdata = x, 
              re_formula = NA, 
              dpar = "mu",
              summary = FALSE)

ystar <- exp(rnorm(n_mcmc, eta + v, 1))
z <- rbinom(n_mcmc, 1, 1 - plogis(xi + u))
y <- z * ystar
quantile(y, probs = c(0.025, 0.5, 0.975))

whereas when I include the correlation (as is correct)…

#####################
## include correlation
v <- rnorm(n_mcmc, 
           mean = u * sigma_v / sigma_u * rho,
           sd = sqrt((1 - rho^2) * sigma_v ^ 2))
ystar <- exp(rnorm(n_mcmc, eta + v, 1))
y <- z * ystar
quantile(y, probs = c(0.025, 0.5, 0.975))
2 Likes

Thanks for following up, I think I see what you mean. @paul.buerkner is this the intended behavior or a bug?

I will take a look asap!

You were absolutely right both about that there is indeed a problem and its cause. Thank you for providing such a clear reproducible example!

In one of the recent brms version, I refactored the functions to generate draws for new levels. There, I appeared to have made the mistake of sampling indices of (old) levels, from which to extract draws, independently for each varying effect which effectively resulted in ignoring their correlations.

The bug should now be fixed in the github version of brms and I would appreciate you double checking it.

Thank you so much for bringing this to our attention and helping us to fix it!

5 Likes

Many thanks @paul.buerkner – happy to help!

I can see the change you’ve made to extract_draws.R when sample_new_levels = “uncertainty”. This looks better to me now, and gives me a sensible answer to the Gaussian model example above. However, the other part of my question was regarding a hurdle model, where the sample_new_levels = “gaussian” option also seems to be ignoring correlation, which I included in the model using the (1 | q | id) syntax. I don’t really understand the code, but it looks like this could be a more difficult issue to fix?

1 Like

You are right. Indeed, there is another problem, which is more difficult to fix as it requires refactoring some of the code in extract_draws. I opened an issue on github: https://github.com/paul-buerkner/brms/issues/779

1 Like

I have now made a fix on github so that varying effects are now post-processed together before they are split up to be stored in their respective model parts. This implies that correlations should now correctly be modeled even for new levels. Can you please double check whether the new results are more sensible to you? I don’t know if they match your expectation, but the correlation is at least considered correctly now, I hope.

1 Like

Yes, matches perfectly. Many thanks for all your excellent work on brms – I will be using it a lot!

Many thanks for providing such clear examples of the problem and testing my fixes!