Estimating between-group differences in proportions in a bernoulli regression

In his amazing course on Bayesian modeling for psychologists @mike-lawrence discusses how to estimate between-group differences in proportions from a bernoulli regression model (see [https://youtu.be/b4t96ZaZHdQ] about an hour in).

Now to me this sounds much more intuitive than odds ratios, so I just want to check I’m doing it right.

Here’s some data where 24-hour caffeine-deprived heavy coffee drinkers indicate whether they expected the beverage they were about to be given to make them feel better. One group were told they were going to be given tap water, another that they woudl be given decaf, and a third that they would be given caffeinated coffee.

Please excuse the R code if you use different software and the tidyverse syntax if you use base R.

df3_postExp <- data.frame(id = 1:62, 
                          group = factor(c(rep("toldWater", 20),
                                                      rep("toldDecaf", 22),
                                                      rep("toldCaffeine", 20)), 
                                         levels = c("toldWater", "toldDecaf", "toldCaffeine")),
                          expEffect = factor(c(rep("expNoEffect",16), rep("expFeelBetter",4),
                                               rep("expNoEffect",16), rep("expFeelBetter",6),
                                               rep("expNoEffect",4), rep("expFeelBetter",16)),
                                             levels = c("expNoEffect", "expFeelBetter")))

20% in the told water group expected to feel better, 27.3% in the told decaf group, 80% in the told caffeine group.

If we analyse this in brms using a bernoulli regression…

library(brms)

# run model
mod <- brm(formula = expEffect ~ group,
           family = bernoulli(link = "logit"),
           data = df3_postExp,
           seed = 1234)

# extract parameter estimates
samples <- as.data.frame(as.mcmc(mod, combine_chains = T))

This returns three parameters, an intercept coefficient representing the log-odds of expecting to feel better in the told water group, a coefficient representing the difference in log odds between the told decaf and told water groups, and a coefficient representing the difference in log odds between the told caffeine and the told water groups.

Now I am going to follow the procedure for estimating the between-group difference in proportion as I understand it. First we calculate the log odds of expecting to feel better in each group by adding together the chains of parameter estimates.

samples %>% 
  mutate(lo_tw = b_Intercept,
         lo_td = b_Intercept + b_grouptoldDecaf,  
         lo_tc = b_Intercept + b_grouptoldCaffeine) -> samples

Next we apply the logit transformation to convert these ‘cell’ log-odds into proportions, using the plogis() function in R

 # estimated proportion in each cell
samples %>%
  mutate(prop_tw = plogis(lo_tw),
         prop_td = plogis(lo_td),
         prop_tc = plogis(lo_tc)) -> samples

Now if we take the mean of each of these new columns

mean(samples$prop_tw)
mean(samples$prop_td)
mean(samples$prop_tc)

# output
[1] 0.2037076
[1] 0.2746663
[1] 0.8012841

These estimates clearly are excellent estimates of the true proportions in each group: ~ 20% in the told water group, ~ 27% in the told decaf group, and 80% in the told caffeine group.

We can also go a step further and estimate the differences in proportion between groups

# difference in estimated proportion
samples %>%
  mutate(diff_twtd = prop_td - prop_tw,
         diff_twtc = prop_tc - prop_tw,
         diff_tdtc = prop_tc - prop_td) -> samples

If we take the mean of these new ‘difference in estimated proportion’ column

mean(samples$diff_twtd)
mean(samples$diff_twtc)
mean(samples$diff_tdtc)

# output
[1] 0.07095875
[1] 0.5975765
[1] 0.5266178

These estimated differences in proportions also closely match the actual differences in proportions: 27 - 20 = 7% difference in proportion who expected to feel better between the told decaf and the told water group, 80 - 20 = 60% difference between the told caffeine and the told water group, and 80 - 27 = 53% difference between the told caffeine and the told decaf group.

Now my question is is all of this kosher?

Am I doing this the right way, or should I be generating predictions from the model and then estimating group differences from those posterior predictions?

As a follow-up question, if this method is correct could I use the same technique for within-subjects models? I only because in within-subject models each subject also has at least their own intercept. parameter)?

Hi, sorry for the wait… it’s an excellent question you posted.

In my view this is ok, what you’ve done. You’re basically doing a posterior_epred(). Have you done a posterior_predict() and looked at what they imply?

1 Like

You’re doing a lot “by-hand” and as @torkar notes there are functions for more easily achieving the same computations.

Yup, same procedure would apply to within-Ss models; those will still have coefficients reflecting the mean-across-participants effects.

1 Like

Thanks for the inpute @torkar. I will most certainly look into posterior_epred.

And thanks @mike-lawrence. Good to know the population-averaged effects still hold in within-subjects contexts.