Predict.brmsfit: new level in a hurdle model

Dominic_Magirr · October 22, 2019, 3:54pm

I’ve been using brms to fit a longitudinal hurdle model where the random effects from the two parts of the model are correlated. It seems to work really well. However, I’m struggling to get correct predictions for a new level with the predict method. I’m happy writing my own code for this but wanted to raise my findings here.

I’ll try to explain with 2 examples. Firstly, a gaussian mixed model where each patient (n = 100) has a (correlated) random intercept and slope. Each patient has 4 independent observations at time 0, and 4 independent observations at time 1.

library(brms)
set.seed(180)
u <- rnorm(100, sd = 2)
v <- rnorm(100, mean = -0.95 * u, sd = sqrt((1 - 0.95^2) * 4)) 
y0 <- rnorm(400, mean = u, sd = 1)
y1 <- rnorm(400, mean = u + v, sd = 1)
id <- 1:100
dat <- data.frame(id = rep(id, 8), 
                  time = rep(0:1, each = 400),
                  y = c(y0, y1))

ggplot2::ggplot(data = dat,
                mapping = aes(x = time, y = y, group = id)) +
  geom_line() +
  geom_point()

fit <- brm(y ~  1 + (1  + time | id),
           data = dat,
           family = gaussian())

When I try to predict the response at time 1 for a new patient (101) using the default method I get a prediction interval that is clearly too wide

predict(fit, 
        newdata = data.frame(id = 101, time = 1),
        allow_new_levels = TRUE)

whereas the sample_new_levels = “gaussian” option looks much better

predict(fit, 
        newdata = data.frame(id = 101, time = 1), 
        allow_new_levels = TRUE,
        sample_new_levels = "gaussian")

Now, I would like to use the sample_new_levels = “gaussian” for a hurdle model:

set.seed(180)
u <- rnorm(100, sd = 2)
v <- rnorm(100, mean = -0.95 * u, sd = sqrt((1 - 0.95^2) * 4)) # p(zero) is negatively correlated with Y*
ystar <- exp(rnorm(800, mean = u, sd = 1))
z <- rbinom(800, size = 1, prob = 1 - plogis(-1 + v)) # p(cross hurdle) = 1 - p(zero)
id <- 1:100

dat <- data.frame(id = rep(id, 8), 
                  y = z * ystar)


fit_hurdle <- brm(bf(y ~  1 + (1 | q | id),
                     hu ~ 1 + (1 | q | id)),
                  data = dat,
                  iter = 4000,
                  family = hurdle_lognormal())


predict(fit_hurdle, 
        newdata = data.frame(id = 101), 
        allow_new_levels = TRUE,
        sample_new_levels = "gaussian",
        robust = TRUE)

But this gives me values that are too small. It’s difficult to see from a plot of the data but I can simulate directly from the model

u <- rnorm(1e6, sd = 2)
v <- rnorm(1e6, mean = -0.95 * u, sd = sqrt((1 - 0.95^2) * 4)) # p(zero) is negatively correlated with Y*
ystar <- exp(rnorm(1e6, mean = u, sd = 1))
z <- rbinom(1e6, size = 1, prob = 1 - plogis(-1 + v)) # p(cross hurdle) = 1 - p(zero)
y <- z * ystar

quantile(y, probs = c(0.025, 0.5, 0.975))

It looks like the predict method is not taking into account correlation in the hurdle model, whereas it was for the gaussian model.

Operating System: macOS 10.14.6
brms Version: 2.10.0

martinmodrak · October 24, 2019, 11:08am

Sorry, can’t answer now, but maybe @jonah can help?

jonah · October 24, 2019, 5:40pm

Dominic_Magirr:

u <- rnorm(1e6, sd = 2) 
v <- rnorm(1e6, mean = -0.95 * u, sd = sqrt((1 - 0.95^2) * 4)) # p(zero) is negatively correlated with Y* 
ystar <- exp(rnorm(1e6, mean = u, sd = 1)) 
z <- rbinom(1e6, size = 1, prob = 1 - plogis(-1 + v)) # p(cross hurdle) = 1 - p(zero) y <- z * ystar

Hmm, I don’t have much time too look into this right now unfortunately, but I did try something quickly:

if I take the code above for simulating a new y and overlay quantiles of log(y) on top of the posterior predictive distribution of a new data point at new level 101 (also on the log-scale, so it’s easier to see), then I get something that looks pretty reasonable to me:

preds <- posterior_predict(fit_hurdle, 
                           newdata = data.frame(id = 101), 
                           allow_new_levels = TRUE
                           sample_new_levels = "gaussian")
log_preds <- log(preds)
colnames(log_preds) <- "log-prediction for id=101"

library(bayesplot)
log_y_quants <- quantile(log(y + 0.001), probs = c(0.025, 0.5, 0.975))
mcmc_hist(log_preds) + vline_at(log_y_quants, size = 1)

Dominic_Magirr · October 25, 2019, 9:30am

Many thanks for taking a look. I agree, I probably chose a poor example, or perhaps it’s not a big problem generally. Strictly speaking, though, I think there is an issue.

Below I have produced the posterior predictions manually from the posterior samples, where I ignore the correlation between the random effects. This gives me the same as the predict method…

### predict manually
ps <- posterior_samples(fit_hurdle)

sigma_error <- ps[,"sigma"]
sigma_u <- ps[,"sd_id__hu_Intercept"]
sigma_v <- ps[,"sd_id__Intercept"]
rho <- ps[,"cor_id__Intercept__hu_Intercept"]

n_mcmc <- length(rho)

x <- data.frame(id = NA, time = "2")

### simulate u_i' and v_i' (ignore rho)
u <- rnorm(n_mcmc, sd = sigma_u)
v <- rnorm(n_mcmc, sd = sigma_v)

### extract draws from xi = x*gamma
xi <- qlogis(fitted(fit_hurdle, 
                    newdata = x, 
                    re_formula = NA, 
                    dpar = "hu",
                    summary = FALSE))

### extract draws from eta = x*beta
eta <- fitted(fit_hurdle, 
              newdata = x, 
              re_formula = NA, 
              dpar = "mu",
              summary = FALSE)

ystar <- exp(rnorm(n_mcmc, eta + v, 1))
z <- rbinom(n_mcmc, 1, 1 - plogis(xi + u))
y <- z * ystar
quantile(y, probs = c(0.025, 0.5, 0.975))

whereas when I include the correlation (as is correct)…

#####################
## include correlation
v <- rnorm(n_mcmc, 
           mean = u * sigma_v / sigma_u * rho,
           sd = sqrt((1 - rho^2) * sigma_v ^ 2))
ystar <- exp(rnorm(n_mcmc, eta + v, 1))
y <- z * ystar
quantile(y, probs = c(0.025, 0.5, 0.975))

jonah · October 25, 2019, 1:51pm

Thanks for following up, I think I see what you mean. @paul.buerkner is this the intended behavior or a bug?

paul.buerkner · October 25, 2019, 3:47pm

I will take a look asap!

paul.buerkner · October 26, 2019, 3:38am

You were absolutely right both about that there is indeed a problem and its cause. Thank you for providing such a clear reproducible example!

In one of the recent brms version, I refactored the functions to generate draws for new levels. There, I appeared to have made the mistake of sampling indices of (old) levels, from which to extract draws, independently for each varying effect which effectively resulted in ignoring their correlations.

The bug should now be fixed in the github version of brms and I would appreciate you double checking it.

Thank you so much for bringing this to our attention and helping us to fix it!

Dominic_Magirr · October 28, 2019, 10:55am

Many thanks @paul.buerkner – happy to help!

I can see the change you’ve made to extract_draws.R when sample_new_levels = “uncertainty”. This looks better to me now, and gives me a sensible answer to the Gaussian model example above. However, the other part of my question was regarding a hurdle model, where the sample_new_levels = “gaussian” option also seems to be ignoring correlation, which I included in the model using the (1 | q | id) syntax. I don’t really understand the code, but it looks like this could be a more difficult issue to fix?

paul.buerkner · October 28, 2019, 12:13pm

You are right. Indeed, there is another problem, which is more difficult to fix as it requires refactoring some of the code in extract_draws. I opened an issue on github: https://github.com/paul-buerkner/brms/issues/779

paul.buerkner · October 28, 2019, 7:07pm

I have now made a fix on github so that varying effects are now post-processed together before they are split up to be stored in their respective model parts. This implies that correlations should now correctly be modeled even for new levels. Can you please double check whether the new results are more sensible to you? I don’t know if they match your expectation, but the correlation is at least considered correctly now, I hope.

Dominic_Magirr · October 29, 2019, 9:07am

Yes, matches perfectly. Many thanks for all your excellent work on brms – I will be using it a lot!

paul.buerkner · October 29, 2019, 9:12am

Many thanks for providing such clear examples of the problem and testing my fixes!

Topic		Replies	Views
Overall change in hurdle model with random effects brms	2	1530	August 20, 2018
Need help understanding hurdle (hurdle_gamma) models using brms brms	2	1453	July 19, 2020
Estimates in hurdle gamma model brms fitting-issues	1	3602	November 11, 2020
Priors for hurdle models in brms Modeling priors , brms	2	738	August 21, 2023
How to easily plot the conditional effects of the hurdle part of brms model? Modeling	2	925	June 29, 2020

Predict.brmsfit: new level in a hurdle model

Related topics