I can certainly sympathise with the apparent desperation in this post (https://stats.stackexchange.com/questions/281235/posterior-predictive-checking-for-bernoulli-distributed-data). It seems like when you ask how to do posterior predictive checks for a logistic regression you get ‘look at this vignette’ or snippets of code that get you close to where you need to be by showing you how to generate the draws from the posterior (see PPC for simple logistic regression in Stan and Posterior predictive check for binomial regression) but only take you that far. I fully understand that the wonderful people at the Stan project administer this website in their spare time and thus don’t have time to work up a full answer. So I have no right to expect details, but I sure would like some.
Can someone help me by actually showing me how to graph a PPC with some data and code I provide?
In this dataset there are three groups of caffeine-deprived coffee drinkers. One group was given water, one was given decaf, and one was given caffeinated coffee. Each participant indicated whether their caffeine withdrawal was alleviated after they drank the beverage they were given (response = “feelBetter” = 1) or not (response = “noEffect”).
Here is the data
library(rstan)
library(dplyr)
df <- data.frame(id = 1:62,
group = factor(c(rep("water", 20),
rep("decaf", 22),
rep("caffeine", 20)),
levels = c("water", "decaf", "caffeine")),
effect = factor(c(rep("noEffect",16), rep("feelBetter",4),
rep("noEffect",16), rep("feelBetter",6),
rep("noEffect",4), rep("feelBetter",16)),
levels = c("noEffect", "feelBetter")))
So, taking group into account, how many people expected that their beverage would make them feel better?
df %>% group_by(group, effect) %>%
summarise(count = n()) %>%
mutate(perc = round(100*count/sum(count),2))
# output
# group effect count perc
# 1 water noEffect 16 80
# 2 water feelBetter 4 20
# 3 decaf noEffect 16 72.7
# 4 decaf feelBetter 6 27.3
# 5 caffeine noEffect 4 20
# 6 caffeine feelBetter 16 80
I ran the model as a logistic regression with levels-means coding of group factors (each group getting their own intercept parameter). Here is the model code.
dList <- list(N = nrow(df),
nGroup = nlevels(df$group),
gIndex = as.integer(df$group),
effect = as.integer(df$effect) - 1)
ms <- "
data{
int<lower=0> N;
int<lower=0> nGroup;
int<lower=1> gIndex[N];
int<lower=0> effect[N];
}
parameters{
vector [nGroup] a;
}
model{
a ~ normal(0,10);
for (i in 1:N) {
effect[i] ~ bernoulli_logit(a[gIndex[i]]);
}
}
generated quantities{
real y_draw[N];
for (i in 1:N) {
y_draw[i] = bernoulli_logit_rng(a[gIndex[i]]);
}
}
"
postExp <- stan(model_code = ms,
data = dList,
warmup = 1e3,
iter = 4e3,
cores = 1,
chains = 1,
seed = 2334)
The summary…
print(postExp, pars = "a", probs = c(0.025, 0.975))
# output
# mean se_mean sd 2.5% 97.5% n_eff Rhat
# a[1] -1.50 0.01 0.58 -2.72 -0.46 2172 1
# a[2] -1.04 0.01 0.48 -2.02 -0.12 2415 1
# a[3] 1.46 0.01 0.58 0.44 2.68 1892 1
…all looks about right. I know this because when I convert these group log odds to probabilities the model predicted proportions of people saying their beverage made them feel better matches the raw probabilities from the descriptive statistics above pretty closely
logOdds <- as.data.frame(summary(postExp, pars = "a", probs = c(0.025, 0.975))$summary)$mean
logitToProb <- function(lo) exp(lo)/(1+exp(lo))
groupProbs <- sapply(1:3, function (i) round(logitToProb(logOdds[i])*100,1))
names(groupProbs) = c("water", "decaf", "caffeine")
groupProbs
# output
# water decaf caffeine
# 18.2 26.2 81.1
So the model seems to be working. But I need PPCs. So how to do them?
In order to generate the draws from the joint posterior I extracted the draws from the generated quantities block.
post_draw <- data.frame(rstan::extract(postExp, pars = "y_draw"))
There are 3000 draws from the joint posterior for each observed data point, which in this case is each participant. That is 3000 0s and 1s, hopefully in the correct group proportions for each participant within each group
head(post_draw[,1:6])
# output
# y_draw.1 y_draw.2 y_draw.3 y_draw.4 y_draw.5 y_draw.6
# 1 0 0 0 0 0
# 0 0 0 0 1 0
# 0 0 0 1 0 0
# 1 0 1 0 0 1
# 0 1 1 0 0 0
# 0 0 0 0 0 0
So great, I have what I need, but what do I do with this matrix of draws from the posterior?
I can take means and get upper and lower HPDIs for each row of the matrix and then bind that to the actual data, but how to graph it?
I tried the ppc_error_binned
function from the bayesplot package but to be honest I didn’t really know what I was doing, since the vignette’s all seem to do with questions that don’t really suit my problem (e.g. poisson regression or logit regression where each value of the outcome is a count of successes within several trials rather than a single trial [0,1] value for each participant).
Again an example PPC figure based on my own data with the code and an explanation would be very much appreciated. Again I have no right to expect this, just a hope.