Probability that the response is below a given value, conditional on x in a hurdle-lognormal model with spline

tamas.ferenci · April 24, 2021, 8:40pm

Take the following simple hurdle lognormal model with splines:

library(brms)

set.seed(1)

N <- 1000
x <- runif(N)

pi <- 0.18 + 0.1*x
mu_log <- 1.8 + 0.3 * x + 12.1 * x^2
sigma_log <- 0.37

y <- (1 - rbinom(N, 1, prob = pi)) * rlnorm(N, mu_log, sigma_log)

d <- data.frame(y = y, x = x)

fit <- brm(bf(y ~ s(x), hu ~ s(x)), data = d, family = hurdle_lognormal())

How can I estimate the probability that the response is below a given value, conditional on x taking a given value? (E.g., what is the probability that y<100 given x=0.5.)

I.e., I would answer it using simulation with

mean(posterior_predict(fit, newdata = data.frame(x = 0.5),
                       nsamples = nsamples(fit))<100)

But the question is: how to calculate it analytically? (Optimally with confidence interval.) If I am not mistaken, it shouldn’t be difficult, as it is “just” an integration of the posterior (conditional on x), complicated by the fact the we have a spline, and a hurdle model.

martinmodrak · April 30, 2021, 8:58am

Hi,
actually using prediction is the recommended and easiest way to do this sort of analysis - in fact you are implicitly integrating over the posterior - the samples represent the posterior and any expectation value over the posterior can be approximated by an expectation over the samples.

However it seems that the problem is primarily identifying the expectation value you are interested in. The code you shared computes the expectation of a binary (0/1 valued) function that is one whenever the observed sample is < 100. There is no credible interval here - it is a single integral, giving a single number.

You may instead use prepare_predictions to get samples of the linear predictor for both y and hu and for sigma. You could then compute for each sample the probability that the hurdle is not passed and that given the hurdle is passed what is the probability that the response is < 100 (via plnorm). Summing those gives you a continuous value between 0 and 1 for each sample. You can then compute expectations with respect to this value (mean, bounds of 95% confidence intervals, …), which are not the probabilities that a single observation future will be < 100 but rather what proportion of samples would be < 100 if you got infinite new observations. I hope it is clear why for a single sample, you just get a single probability without any variability while the proportion has variability baked in. If not, feel free to ask for clarifications.

Best of luck with your model!

tamas.ferenci · May 3, 2021, 5:38pm

Thank you very much!

I appreciate if you’d confirm it, but I think this is what you described:

fit2 <- brm(bf(y ~ x, hu ~ x), data = d, family = hurdle_lognormal())

mean(posterior_predict(fit2, newdata = data.frame(x = 0.5),
                       nsamples = nsamples(fit2))<100)

dparspred <- prepare_predictions(fit2, newdata = data.frame(x=0.5, y = 0))$dpars
predmus <- dparspred$mu$fe$b %*% t(dparspred$mu$fe$X)
predhus <- dparspred$hu$fe$b %*% t(dparspred$hu$fe$X)

res <- plogis(predhus) + (1-plogis(predhus))*plnorm(100, predmus, dparspred$sigma)
mean(res)
t.test(res)$conf.int

However, I am quite stuck with the original version, where we have splines. Is the format of the value of prepare_predictions documented anywhere…? (Sorry for the uninformed question.) I see that we have fe just as without splines, but we also have an sm, within that we again have an fe, but an re as well; I’m lost about their meaning…

martinmodrak · May 3, 2021, 5:57pm

Sorry, I was probably overthinking this - it should be easier to just use posterior_linpred to get predmus predhus. Then you wouldn’t need to worry about the structure… Other than that the calculation looks OK.

Additionally, I don’t think it makes sense to run a t.test on the results. If you want a posterior credible interval, you can just run quantile(res, probs = c(0.025,0.975))

Hope that makes sense!

tamas.ferenci · May 3, 2021, 6:30pm

Thanks again!

Additionally, I don’t think it makes sense to run a t.test on the results. If you want a posterior credible interval, you can just run quantile(res, probs = c(0.025,0.975))

Of course, obviously, sorry…

Sorry, I was probably overthinking this - it should be easier to just use posterior_linpred to get predmus predhus . Then you wouldn’t need to worry about the structure… Other than that the calculation looks OK.

Great, for the record, here is the final code:

predmus <- posterior_linpred(fit, newdata = data.frame(x=0.5), dpar = "mu")
predhus <- posterior_linpred(fit, newdata = data.frame(x=0.5), dpar = "hu")
res <- plogis(predhus) +
  (1-plogis(predhus))*plnorm(100, predmus,
                             summary(fit)$spec_pars["sigma", "Estimate"])
quantile(res)

Thank you again.

Topic		Replies	Views
Hurdle Prior or Piecewise Function? Modeling specification , brms	1	38	September 8, 2024
Brms: does the lognormal part of the hurdle_lognormal() regression include zeros into analysis? Modeling	8	2223	September 27, 2020
How to convert hurdle lognormal model posterior values to actual values of y variable? Modeling	1	382	August 21, 2020
Priors for hurdle models in brms Modeling priors , brms	2	738	August 21, 2023
Brms: discrepancy between conditional effects plots and calculations based on posterior Modeling	1	835	September 25, 2020

Probability that the response is below a given value, conditional on x in a hurdle-lognormal model with spline

Related topics