Differing posterior predictive checks for logistic binomial model with and without addition-terms

Hi all,

Not so much an issue as double-checking expected behaviour:

I’m fitting a logistic binomial model where the response variable is the sum of how many times a target picture was looked in a certain time period (out of how many looks there were at all pictures during that period). This kind of response variable falls under “addition-terms” according to the brms documentation. The model estimates are plausible and the fit is good, but the posterior predictive check isn’t as good as if I fit a regular logistic binomial model to the same data but unaggregated. Below is a dummy example with models with and without addition-terms.

My questions are:

  • The posterior predictive check of the addition-terms model isn’t actually bad, but is it expected that it would not be as clean as the regular model? And if so, why?
  • Out of curiosity, is there any way to do a predictive check with the addition-terms model on the binomial scale rather than predicting the sum part of the addition-terms?
# fake data

## long format
(dat_long <- data.frame(
  subj = rep(1:10, each = 100),
  item = rep(1:10, each = 10),
  bin  = rep(1:10, times = 10),
  cond = c(-.5,.5),
  pTarget = rbinom(1000, 1, .6)
))

## aggregated over bins/items
(dat_aggregated <- dat_long %>%
  dplyr::group_by(subj, cond) %>%
  dplyr::summarise(sum = sum(pTarget), N = length(pTarget)))


# model using addition-terms
m_aggregated <- brm(formula = sum | trials(N) ~ cond,
                     family = binomial(),
                     iter = 5000,
                     prior = priors,
                     data =  dat_aggregated)

# regular model
m_long <- brm(formula = pTarget ~ cond,
                   iter = 1000,
                   family = binomial(),
                   prior = priors,
                   data =  dat_long)


# posterior predictive checks
pp_check(m_aggregated)
pp_check(m_long)

image
image

  • Operating System: Windows 10
  • brms Version: 2.14

Thanks!

2 Likes

Sorry for not getting to you earlier, your question is relevant and well written and I am glad you came up with a neat minimal example of the problem.

I would prefer to call the models “aggregated” and “disaggregated”, because you could (and maybe even should - for clarity) write the disaggregated model with addition terms (pTarget | trials(1) ~ cond) and addition terms are also used for other very different things.

I think what you observe is a combination of few things:

  1. the default “density” check is not very useful for binary outcomes (or generally discrete outcomes with few possible options). We know that only 0/1 outcomes are possible, so most of the density plot tells us basically nothing useful and only makes the scale weird. Using pp_check(m_long, type = "bars", nsamples = 1000) gives IMHO a better look:

  1. The aggregated model has much more possible outcomes for each row of the dataset - if we expect say 25 out of 40 trials to turn out succesful, seeing 24 or 26 is just not surprising at all. In the disaggregated model this gets distributed across many rows and averaged out, but looking at the aggregated model, we just expect a bit more variability. Once again, the density plot is a bit misleading as the actual data are not so smooth - when I run the bars check for the aggregated model, I get an ugly but somewhat revealing plot:

So we see that the actual data y are not really smooth and we expect a lot of variability in the middle portion of the plot (hope it is clear what the plot shows, if not, feel free to ask for clarification).

Yes, but you’ll need to do this manually - you can make predictions with posterior_predict, divide them by the number of trials for each row and do the same for the observed data, e.g.:

pred <- posterior_predict(m_aggregated, cores = 1)
# Divide the samples for each row by the corresponding number of trials
pred_divided <- sweep(pred, MARGIN = 1, STATS = dat_aggregated$N, FUN = "/") 
y_divided <- dat_aggregated$sum / dat_aggregated$N

samples_to_show <- sample(1:(dim(pred_divided)[1]), size = 30)
bayesplot::ppc_dens_overlay(y_divided, pred_divided[samples_to_show, ])

Hope that helps at least a little.

4 Likes

Thanks for the response Martin, it was extremely helpful!