Cumulative probit model: predicting the probability an observation is in a category or higher (i.e. above or below a single threshold)

I have data for which the outcome is ordered categories. I fit a cumulative probit model. Among other things, I am also interested in predicting for each observation the probability that it is in a specific category or higher (i.e. above a threshold). For example, let’s say I have 7 categories, and the observation is “S - Susceptible” if it is in categories 1-4 and is “R - Resistant” if it is in categories 5-7.

In brms, handy fitted() function will allow me to obtain the probability that an observation is in each of the 7 categories. However, in order to find the probability that it is in category 5 or higher, my thought was that I could use fitted(model, scale="linear", summary=F) to obtain the posterior samples for the mean on the scale of the linear predictor for each observation. Then I could use pnorm() with the 4th threshold and mean as the values from fitted() to find the probability that the observation is below or above the 4th threshold (in cats 1-4 or 5-7).

Does that make sense? Am I understanding the cumulative probit model correctly?

Example code to do this is below:


s <- 1:7
y <- sample(s, size=50, replace=T)
trt <- rbinom(n=50, size=1, prob=0.5)
d1 <-,trt)
d1$trt <- factor(d1$trt)

#cumulative probit model
fit1 <- brm(y ~ trt, data=d1, family=cumulative(probit), cores=4)

#Find the probability for each observation being R or S
#R = resistant, when the obs is in category 5 or higher (above threshold 4)
#S = susceptible, when the obs is in category 4 or lower (below threshold 4)

#predictions on the linear predictor scale, all posterior samples
ffit1L <- fitted(fit1, scale="linear", summary=F)

#extract draws
s1 <- as_draws_df(fit1)

#compute the probabilities that the observation is in S or R. i.e. the probability that it is below or above threshold 4 
p_y_S <- pnorm(s1$`b_Intercept[4]`, mean=ffit1L[,1:50])
p_y_R <- (1 - pnorm(s1$`b_Intercept[4]`, mean=ffit1L[,1:50]))
p_y_S <- data.frame(p_y_S)
p_y_R <- data.frame(p_y_R)

mean_probs_y_S <- as.numeric(sapply(p_y_S, mean))

mean_probs_y_R <- as.numeric(sapply(p_y_R, mean))

I think (but please check!) the Post.Prob column returned by hypothesis(fit1, "trt1 > Intercept[4]") gives you the answer you require here (the prob of being 5, 6, or 7 when trt = 1). And, for when trt is at its reference, hypothesis(fit1, "0 > Intercept[4]").

Thanks. I tried it but it doesn’t give a similar answer (and it doesn’t appear to be correct)… Also, if I had a lot of predictors, then I’m not sure how I would use it.

Without regard to the model structure, in problems this easy, you can simply add the probability it is in category 5, in category 6, and in category 7 to get the probability that it’s in one of the categories.

If you want to dive into the probit structure, here’s the doc:

Given that the sum of categories is just a bunch of cdf diffs, you can compute the whole thing with a single cdf call.

1 Like

cool, thanks. Adding all the fitted() samples for the probabilities for each category gives me the same answers as my code above. So, seems like there is a variety of ways to go about it.
Thanks for the check!