# Probability that the response is below a given value, conditional on x in a hurdle-lognormal model with spline

Take the following simple hurdle lognormal model with splines:

``````library(brms)

set.seed(1)

N <- 1000
x <- runif(N)

pi <- 0.18 + 0.1*x
mu_log <- 1.8 + 0.3 * x + 12.1 * x^2
sigma_log <- 0.37

y <- (1 - rbinom(N, 1, prob = pi)) * rlnorm(N, mu_log, sigma_log)

d <- data.frame(y = y, x = x)

fit <- brm(bf(y ~ s(x), hu ~ s(x)), data = d, family = hurdle_lognormal())
``````

How can I estimate the probability that the response is below a given value, conditional on x taking a given value? (E.g., what is the probability that y<100 given x=0.5.)

I.e., I would answer it using simulation with

``````mean(posterior_predict(fit, newdata = data.frame(x = 0.5),
nsamples = nsamples(fit))<100)
``````

But the question is: how to calculate it analytically? (Optimally with confidence interval.) If I am not mistaken, it shouldnâ€™t be difficult, as it is â€śjustâ€ť an integration of the posterior (conditional on x), complicated by the fact the we have a spline, and a hurdle model.

1 Like

Hi,
actually using prediction is the recommended and easiest way to do this sort of analysis - in fact you are implicitly integrating over the posterior - the samples represent the posterior and any expectation value over the posterior can be approximated by an expectation over the samples.

However it seems that the problem is primarily identifying the expectation value you are interested in. The code you shared computes the expectation of a binary (0/1 valued) function that is one whenever the observed sample is < 100. There is no credible interval here - it is a single integral, giving a single number.

You may instead use `prepare_predictions` to get samples of the linear predictor for both `y` and `hu` and for `sigma`. You could then compute for each sample the probability that the hurdle is not passed and that given the hurdle is passed what is the probability that the response is < 100 (via `plnorm`). Summing those gives you a continuous value between 0 and 1 for each sample. You can then compute expectations with respect to this value (mean, bounds of 95% confidence intervals, â€¦), which are not the probabilities that a single observation future will be < 100 but rather what proportion of samples would be < 100 if you got infinite new observations. I hope it is clear why for a single sample, you just get a single probability without any variability while the proportion has variability baked in. If not, feel free to ask for clarifications.

Best of luck with your model!

1 Like

Thank you very much!

I appreciate if youâ€™d confirm it, but I think this is what you described:

``````fit2 <- brm(bf(y ~ x, hu ~ x), data = d, family = hurdle_lognormal())

mean(posterior_predict(fit2, newdata = data.frame(x = 0.5),
nsamples = nsamples(fit2))<100)

dparspred <- prepare_predictions(fit2, newdata = data.frame(x=0.5, y = 0))\$dpars
predmus <- dparspred\$mu\$fe\$b %*% t(dparspred\$mu\$fe\$X)
predhus <- dparspred\$hu\$fe\$b %*% t(dparspred\$hu\$fe\$X)

res <- plogis(predhus) + (1-plogis(predhus))*plnorm(100, predmus, dparspred\$sigma)
mean(res)
t.test(res)\$conf.int
``````

However, I am quite stuck with the original version, where we have splines. Is the format of the value of `prepare_predictions` documented anywhereâ€¦? (Sorry for the uninformed question.) I see that we have `fe` just as without splines, but we also have an `sm`, within that we again have an `fe`, but an `re` as well; Iâ€™m lost about their meaningâ€¦

Sorry, I was probably overthinking this - it should be easier to just use `posterior_linpred` to get `predmus` `predhus`. Then you wouldnâ€™t need to worry about the structureâ€¦ Other than that the calculation looks OK.

Additionally, I donâ€™t think it makes sense to run a `t.test` on the results. If you want a posterior credible interval, you can just run `quantile(res, probs = c(0.025,0.975))`

Hope that makes sense!

1 Like

Thanks again!

Additionally, I donâ€™t think it makes sense to run a `t.test` on the results. If you want a posterior credible interval, you can just run `quantile(res, probs = c(0.025,0.975))`

Of course, obviously, sorryâ€¦

Sorry, I was probably overthinking this - it should be easier to just use `posterior_linpred` to get `predmus` `predhus` . Then you wouldnâ€™t need to worry about the structureâ€¦ Other than that the calculation looks OK.

Great, for the record, here is the final code:

``````predmus <- posterior_linpred(fit, newdata = data.frame(x=0.5), dpar = "mu")
predhus <- posterior_linpred(fit, newdata = data.frame(x=0.5), dpar = "hu")
res <- plogis(predhus) +
(1-plogis(predhus))*plnorm(100, predmus,
summary(fit)\$spec_pars["sigma", "Estimate"])
quantile(res)
``````

Thank you again.

1 Like