I have a question that’s both related to brms specifically (primarily prediction/post-processing of outputs from an ordinal model), which I hope that somebody in the community is able to help with.
The problem at hand is secondarily related to optimal modelling of a skewed count outcome with zero-inflation and upper truncation, which I’d be happy to receive modelling suggestions for as well.
Context and a minimal reproducible example code is provided below.
Intended setting: critical care randomised clinical trial comparing two (or more) treatments.
Intended outcome: “days alive without life support at day 90”. This outcome (and similarly defined outcomes) are frequently used in the critical care setting, see e.g. Effect of Hydrocortisone on Mortality and Organ Support in Patients With Severe COVID-19 (although this is only up to day 28, the overall issue is the same). The outcome in this example is registered as a whole number from 0 to 90 for each patient (observation). Often (but not always), patients who die are considered as having the worst possible outcome (0 days, or as in the example in the above link, sometimes -1 days in an ordinal model) regardless of how many days they actually had. In this example below, patients who died are considered to have had an outcome of 0.
The distribution of this outcome is usually highly skewed with both a substantial excess of 0’s and truncation above 90. Here’s R code for producing an example dataset that looks roughly like the usual distribution of this outcome. The only additional variable is treatment assignment, and patients in the treatment group have better outcomes:
set.seed(20210119)
a <- c(90 - rnbinom(1050, mu = 8, size = 15), sample(0:90, 150, replace = TRUE), rep(0, 300))
b <- c(90 - rnbinom(1100, mu = 5, size = 15), sample(0:90, 150, replace = TRUE), rep(0, 250))
dta <- data.frame(treatment = 1:3000 > 1500,
res = c(a, b),
res_fct = factor(c(a, b), ordered = TRUE))
library(ggplot2)
ggplot(data = dta) +
geom_bar(aes(x = res_fct, colour = treatment), position = "dodge") +
scale_x_discrete(labels = ifelse(0:90 %% 5 == 0, 0:90, rep("", 91)))
I’m interested in modelling this outcome (in the actual problem, I want to adjust for additional variables, but that can be ignored for now), and subsequently estimating the mean expected number of days alive without life support for a reference patient in each of the two treatment group (with all additional covariates set to sensible values, e.g. most common category for categorical covariates - not included here). Subsequently, I plan to use this to estimate additional quantities, e.g. the difference in means.
I’ve attempted with multiple models, including a hurdle-negative binomial model, which does not provide a perfect fit to the overall distribution based on posterior predictive checks, but excellently captures the expected mean in each group (which is what I’m primarily interested in).
Example:
# Hurdle-negative binomial model
library(brms)
fit_hnb <- brm(bf(res ~ treatment, hu ~ treatment), family = hurdle_negbinomial(), data = dta,
cores = 4, backend = "cmdstanr", refresh = 0)
fit_hnb
pp_check(fit_hnb, type = "dens_overlay_grouped", group = "treatment") # Imperfect overall fit
pp_check(fit_hnb, type = "stat_grouped", stat = "mean", group = "treatment") # Predicted means are captured very well and matches input data acceptably
# Use posterior_epred with two "reference patients" to calculate further quantities (e.g. differences in means with 95% CrIs)
pep_hnb <- posterior_epred(fit_hnb, newdata = data.frame(treatment = c(FALSE, TRUE)))
quantile(pep_hnb[, 1] - pep_hnb[, 2], probs = c(0.025, 0.5, 0.975))
This works and may be acceptable for the problem at hand as it adequately describes the quantity of interest (the mean in each group - this is also the case more complicated example datasets). However, as the first posterior predictive check plot reveals, the overall fit is not great. If I instead model the outcome as ordinal using family = cumulative(), I get a near-perfect posterior predictive check plot for hte overall distribution:
# Ordinal model
fit_ord <- brm(res_fct ~ treatment, family = cumulative(), data = dta,
cores = 4, backend = "cmdstanr", refresh = 0)
fit_ord
pp_check(fit_ord, type = "dens_overlay_grouped", group = "treatment") # Excellent overall fit
This looks a lot better, and the approach is similar to what has been done elsewhere (e.g. in the reference above). It allows me to calculate an odds ratio for the difference between the two groups (under the proportional odds assumption). This is good, but if I want to not only calculate the odds ratio, but also the adjusted expected mean number of days in each group (and derived quantities, as stated above), I’m unsure how to proceed.
If I use posterior_epred for two “reference patients” here, I’m not sure how to convert the results to the actual number of days (or the actual categories of the factor, which could then be easily converted). I assume I have to relate the values to the intercepts, but I am unsure about how to do it.
If I instead use posterior_predict, I get sensible results, but as for other models this seems to predict an overall distribution of potential outcomes and not just the mean. This is not really what I’m interested in. In addition, it seems that posterior_predict generates data that generally matches the input dataset, however, for some reason it appears that it does not generate any values in the lowest (and most common) category, which is confusing. See example code:
pep_ord <- posterior_epred(fit_ord, newdata = data.frame(treatment = c(FALSE, TRUE))) # Unsure how to interpret/convert - use thresholds?
pp_ord <- posterior_predict(fit_ord, newdata = data.frame(treatment = c(FALSE, TRUE))) # Additional uncertainty due to use of posterior_predict and not posterior_epred?
# Comparing percentages of original data:
table(dta$res_fct)/length(dta$res_fct)*100
# With percentages from pp_ord:
table(pp_ord)/length(pp_ord)*100
# Seems to match the input data closely, except that the first level (0) is missing and seems to be merged with the second level (1), even though the levels are in the data? It seems that brms does not predict the most common level at all?
My questions are as follows:
Question 1: Is it possible to use posterior_epred from a brms model with family = cumulative() to obtain expected predictions somehow? I assume that I somehow have to convert the values according to the intercepts/thresholds, but I am not completely sure about how to do it right, and whether this approach really makes sense.
Question 2: Is there an error with posterior_predict in the above example, or how come not a single zero is predicted, while all other values/levels of the factor seem to be reproduced very well?
Question 3, less important: Any additional suggestions for modelling an outcome like this would be very welcome.
Thanks in advance for any help.
- Operating System: macOS Catalina 10.15.7
- brms Version: 2.14.4
- cmdstanr backend, verison 0.3.0 with cmdstan 2.25.0