I’m working on a longitudinal analysis of a binary outcome across 4 intervention arms using a GLMM and I am confused about the response values shown in the plots generated by conditional_effects().
This is the plot of predicted probabilities from predict():
The numbers roughly match an earlier plot I made when looking at the outcome at each follow-up visit. The percentage of the outcome event for the AMMI + PS + Coach group goes above 20% and the predicted probability in the last plot goes above 0.20.
However, when I use conditional_effects(), I get the following plot where the magnitude of the response is much smaller. (the x-scale is different, but the result is the same if transform back to days since baseline)
Is conditional_effects() giving me predicted probabilities or something else? I’m also thinking it might be due to differences in handling the random effect or the smoothing I apply in the first plot.
Here is my code:
# code for plotting raw data at each visit
d %>% group_by(VISIT, arm, PREP_RE) %>% summarise(n = n(), .groups="drop_last") %>%
mutate(`%` = n/sum(n) * 100) %>%
filter(PREP_RE == 1) %>%
ggplot(aes(x = VISIT, y = `%`, group = arm, color = arm)) +
geom_line() + geom_point()
# model
mod_quad = brm(PREP_RE ~ arm*scaled_time + (1|ENROLLMENT_ID) + I(scaled_time^2),
data = d, cores = 4, iter = 2000,
family = bernoulli(link = "logit"))
# conditional effects plot
conditional_effects(mod_quad)
# get predicted probabilities
d2 = d %>%
filter(!is.na(PREP_RE)) %>%
mutate(predicted = predict(mod_quad, type = "response")[ , 1])
# plot predicted probability
ggplot(d2, aes(x = time, y = predicted, color = arm)) +
geom_smooth(se = F) +
labs(x = "Days since baseline", y = "predicted probability")
Welcome. Thanks for the detailed posting and sorry about they delay.
These are two different things. One is giving you predictions based on the fitted model (predict) and the other (conditional_effects) shows the effects of categorical or numeric predictors. I tend to think of these as what could be given some future data (predict) and what is happening right now (conditional_effects) given the current data and model.
predict (which is a wrapper around posterior_predict) returns the posterior mean and 95% credible interval for each data point, including uncertainty due to residual error. (Also, I don’t think the predict “method” for brmsfit objects has a type argument, but returns predictions on the response scale by default; see the help for predict.brmsfit).
conditional_effects, on the other hand, by default uses posterior_epred (this can be changed with the method argument), which returns the expected (fitted) values from the posterior distribution (that is, excluding uncertainty due to residual error). (FYI, just as you can use predict as a wrapper around posterior_predict, you can also use fitted as a wrapper around posterior_epred.) The uncertainty bands with conditional_effects should therefore be narrower than those returned by predict on the same data values.
For a given predictor variable, conditional_effects sets other fixed effects to their mean (for numeric) or reference level (for categorical). conditional_effects actually returns a list containing a data frame for each predictor and interaction. These data frames are then used to generate the conditional_effects plots. You can see the structure of the returned object if you do:
ce = conditional_effects(mod_quad)
str(ce)
One thing to be aware of is that by default conditional_effects excludes random effects when calculating predictions. You can include random effects by setting re_formula=NULL (the default is re_formula=NA). On the other hand, predict/posterior_predict and fitted/posterior_epred include random effects by default but you can exclude them by setting re_formula=NA.
I’m not sure why the response values are so different in your various plots. As far as I know, for brmsfit objects, the predict and fitted functions return predictions on the response scale by default. It’s hard to tell what’s going on without knowing more about d and d2. Also, using the predicted value for each data point (from the predict function) and then geom_smooth to average over those averages doesn’t seem kosher. I think you need to work with the raw posterior draws and then calculate some sort of “average marginal effect” but that one is above my pay grade. Hopefully one of the experts will chime in.
I think the issue is due to whether I take into account the random effects. If I set re_formula=NA in my predict call, then I get results similar to what I am seeing with conditional_effects().
This is a good suggestion and I think I really should have been trying to work with the raw samples in the first place. I was avoiding this approach initially because we had several different models that would have been a headache to code this for. Now that we’ve paired down our list of models, I think this is the right way to go.