Negative binomial brms model: assessing posterior predictive check and changing dependent variable units

I am trying to model some simple count data and have 2 questions. “Units” variable includes discrete values, showing how many units of consultations did these subjects needed to solve a complex task. As you see, some needed relatively much compared to others.

1. Is the pp_check acceptable or should I improve it? If I should improve it, how to do it?

library("tidyverse")
df = read_delim("https://www.dropbox.com/s/ilgye6n2ahgnbig/df.csv?dl=1", ";", escape_double = FALSE, trim_ws = TRUE) 

library("brms")
m = brm(units ~ group, data = df, family = "negbinomial")

min = min(df$units)
max = max(df$units)
pp_check(m, type = "bars")+ xlim(min, max)
pp_check(m, type = "bars_grouped", group = "group") + xlim(min, max)


2. The output of the model seems to give means instead of medians. Is there a way of changing this?

A = df %>% 
  ggplot(aes(group, units))+
  geom_boxplot()+
  ylim(0, 10)+
  ggtitle("plot on raw data")

modelplot = conditional_effects(m, robust = T) 
B = plot(modelplot, plot = FALSE)[[1]] + 
  ylim(0, 10) +
  ggtitle("model")

library("patchwork")
A+B

Edit 1: updated the post according to the good suggestions made by JLC and wgoette

With discrete outcomes you might find ‘pp_check(m, type = “bars”)’ or ‘pp_check(m, type = “bars_grouped”, group = “group”)’ more informative.

1 Like

I agree with @JLC: the type = "bars" argument is better for this case. The advantage is that the plot shows error bars as well, helping you more clearly visualize whether your model is capturing the overall shape. I’m not sure, however, in your case whether the plot will be easy to read since it looks like your x-axis goes all the way out to 75. If there is an upward bound to your raw data, then it might be helpful to do the following to zoom into the relevant part of the posterior predictive check (though it would still be important to remember that the model gives out-of-range estimates in that case):

library(ggplot2)
min <- min(df$units)
max <- max(df$units)
pp_check(m, type = "bars") + 
   xlim(min, max)

As far as improving the posterior, if you decide you need to, then you can consider whether an alternative likelihood is needed and whether there are additional covariates that explain the data generation process. Part of this also includes what your goals are. If you want to generate a model that produces accurate predictions and captures the data generation process, then I’d think that the posterior predictive check is not adequate. If you just want something that summarizes the general shape and central tendency of the distribution so that you can do inference (e.g., comparing mean differences between groups), then maybe this posterior predictive check is “good enough.”

I’d also just throw out that perhaps 500 iterations is too few – not sure if that’s just for testing purposes or not. Similarly, I’m not sure what the scale of your outcome variable (units) is supposed to be. The posterior predictive check plot suggests that observed values as high as ~30 are observed, but then your boxplots have y limits bounding values at 0 and 10.

I’m not sure what version of brms you are using. The most recent documentation of conditional_effects() indicates that the robust argument is set to TRUE by default (which means medians are reported instead of means: see here). Regardless, you may try conditional_effects(m, robust = TRUE) to explicitly request medians instead of means.

1 Like

Thank you JLC and wgoette for the very good thoughts! I did as you recommended and updated the original post.

pp_check now seems reasonable for model run with 2000 iterations? Is there a quick trick for improving the model?

And I should have the latest model of brms. I think the “robust” argument gives the posterior median, which is mostly similar to the mean. However, can I do something similar to lognormal models, where the model output can be mean or median that are both very similar to raw data estimates?

I don’t know about anything quick. Improving a model depends on what the end goal is. If you’re wanting to produce a “better” posterior predictive check, then you need to think about the data generation process. Is there a likelihood other than the negative binomial that describes the raw data better or more appropriately? Are there other variables that predict the outcome that you can include in the model? You may also consider adding predictors to the distributional parameters in the likelihood. I believe for brms you can do the following for the negative binomial: bf(units ~ group, shape ~ group). If I had to guess, you can’t do too much better without additional variables since there’s only so much information about the outcome (units) that you can learn from the predictor (group). No matter how much tweaking and modification you do to the model, there’s no way to overcome inadequate information. Still, specifying a correct likelihood will go a long way.

You may need to play around with the additional arguments passed from conditional_effects() to either fitted() or predict(). I believe that you’re asking about the scale on which the estimates are returned. You can use scale = "response" to get the posterior estimates on the scale of outcome. My guess would be that by default the conditional effects plots you’re showing are giving you the fitted estimates, which will always have more narrow confidence intervals as well. You might want to pass method = "predict" so that you are plotting the posterior predictive estimates (note that I don’t think scale = "response" will work if you do the predictions, but I might be wrong there).

1 Like