Probability that the response is below a given value, conditional on x in a hurdle-lognormal model with spline

Take the following simple hurdle lognormal model with splines:

library(brms)

set.seed(1)

N <- 1000
x <- runif(N)

pi <- 0.18 + 0.1*x
mu_log <- 1.8 + 0.3 * x + 12.1 * x^2
sigma_log <- 0.37

y <- (1 - rbinom(N, 1, prob = pi)) * rlnorm(N, mu_log, sigma_log)

d <- data.frame(y = y, x = x)

fit <- brm(bf(y ~ s(x), hu ~ s(x)), data = d, family = hurdle_lognormal())

How can I estimate the probability that the response is below a given value, conditional on x taking a given value? (E.g., what is the probability that y<100 given x=0.5.)

I.e., I would answer it using simulation with

mean(posterior_predict(fit, newdata = data.frame(x = 0.5),
                       nsamples = nsamples(fit))<100)

But the question is: how to calculate it analytically? (Optimally with confidence interval.) If I am not mistaken, it shouldn’t be difficult, as it is “just” an integration of the posterior (conditional on x), complicated by the fact the we have a spline, and a hurdle model.

1 Like

Hi,
actually using prediction is the recommended and easiest way to do this sort of analysis - in fact you are implicitly integrating over the posterior - the samples represent the posterior and any expectation value over the posterior can be approximated by an expectation over the samples.

However it seems that the problem is primarily identifying the expectation value you are interested in. The code you shared computes the expectation of a binary (0/1 valued) function that is one whenever the observed sample is < 100. There is no credible interval here - it is a single integral, giving a single number.

You may instead use prepare_predictions to get samples of the linear predictor for both y and hu and for sigma. You could then compute for each sample the probability that the hurdle is not passed and that given the hurdle is passed what is the probability that the response is < 100 (via plnorm). Summing those gives you a continuous value between 0 and 1 for each sample. You can then compute expectations with respect to this value (mean, bounds of 95% confidence intervals, …), which are not the probabilities that a single observation future will be < 100 but rather what proportion of samples would be < 100 if you got infinite new observations. I hope it is clear why for a single sample, you just get a single probability without any variability while the proportion has variability baked in. If not, feel free to ask for clarifications.

Best of luck with your model!

1 Like

Thank you very much!

I appreciate if you’d confirm it, but I think this is what you described:

fit2 <- brm(bf(y ~ x, hu ~ x), data = d, family = hurdle_lognormal())

mean(posterior_predict(fit2, newdata = data.frame(x = 0.5),
                       nsamples = nsamples(fit2))<100)

dparspred <- prepare_predictions(fit2, newdata = data.frame(x=0.5, y = 0))$dpars
predmus <- dparspred$mu$fe$b %*% t(dparspred$mu$fe$X)
predhus <- dparspred$hu$fe$b %*% t(dparspred$hu$fe$X)

res <- plogis(predhus) + (1-plogis(predhus))*plnorm(100, predmus, dparspred$sigma)
mean(res)
t.test(res)$conf.int

However, I am quite stuck with the original version, where we have splines. Is the format of the value of prepare_predictions documented anywhere…? (Sorry for the uninformed question.) I see that we have fe just as without splines, but we also have an sm, within that we again have an fe, but an re as well; I’m lost about their meaning…

Sorry, I was probably overthinking this - it should be easier to just use posterior_linpred to get predmus predhus. Then you wouldn’t need to worry about the structure… Other than that the calculation looks OK.

Additionally, I don’t think it makes sense to run a t.test on the results. If you want a posterior credible interval, you can just run quantile(res, probs = c(0.025,0.975))

Hope that makes sense!

1 Like

Thanks again!

Additionally, I don’t think it makes sense to run a t.test on the results. If you want a posterior credible interval, you can just run quantile(res, probs = c(0.025,0.975))

Of course, obviously, sorry…

Sorry, I was probably overthinking this - it should be easier to just use posterior_linpred to get predmus predhus . Then you wouldn’t need to worry about the structure… Other than that the calculation looks OK.

Great, for the record, here is the final code:

predmus <- posterior_linpred(fit, newdata = data.frame(x=0.5), dpar = "mu")
predhus <- posterior_linpred(fit, newdata = data.frame(x=0.5), dpar = "hu")
res <- plogis(predhus) +
  (1-plogis(predhus))*plnorm(100, predmus,
                             summary(fit)$spec_pars["sigma", "Estimate"])
quantile(res)

Thank you again.

1 Like