Posterior Prediction from Logit Regression

llewmills · December 7, 2019, 2:19pm

I can certainly sympathise with the apparent desperation in this post (https://stats.stackexchange.com/questions/281235/posterior-predictive-checking-for-bernoulli-distributed-data). It seems like when you ask how to do posterior predictive checks for a logistic regression you get ‘look at this vignette’ or snippets of code that get you close to where you need to be by showing you how to generate the draws from the posterior (see PPC for simple logistic regression in Stan and Posterior predictive check for binomial regression) but only take you that far. I fully understand that the wonderful people at the Stan project administer this website in their spare time and thus don’t have time to work up a full answer. So I have no right to expect details, but I sure would like some.

Can someone help me by actually showing me how to graph a PPC with some data and code I provide?

In this dataset there are three groups of caffeine-deprived coffee drinkers. One group was given water, one was given decaf, and one was given caffeinated coffee. Each participant indicated whether their caffeine withdrawal was alleviated after they drank the beverage they were given (response = “feelBetter” = 1) or not (response = “noEffect”).

Here is the data

library(rstan)
library(dplyr)
df <- data.frame(id = 1:62, 
                 group = factor(c(rep("water", 20),
                                  rep("decaf", 22),
                                  rep("caffeine", 20)), 
                                levels = c("water", "decaf", "caffeine")),
                 effect = factor(c(rep("noEffect",16), rep("feelBetter",4),
                                   rep("noEffect",16), rep("feelBetter",6),
                                   rep("noEffect",4), rep("feelBetter",16)),
                                 levels = c("noEffect", "feelBetter")))

So, taking group into account, how many people expected that their beverage would make them feel better?

df %>% group_by(group, effect) %>% 
       summarise(count = n()) %>% 
       mutate(perc = round(100*count/sum(count),2))

# output

#   group    effect     count  perc
# 1 water    noEffect      16  80  
# 2 water    feelBetter     4  20  
# 3 decaf    noEffect      16  72.7
# 4 decaf    feelBetter     6  27.3
# 5 caffeine noEffect       4  20  
# 6 caffeine feelBetter    16  80

I ran the model as a logistic regression with levels-means coding of group factors (each group getting their own intercept parameter). Here is the model code.

dList <- list(N = nrow(df),
              nGroup = nlevels(df$group),
              gIndex = as.integer(df$group),
              effect = as.integer(df$effect) - 1) 

ms <- "
  data{
    int<lower=0> N;
    int<lower=0> nGroup;
    int<lower=1> gIndex[N];
    int<lower=0> effect[N];
  }

  parameters{
    vector [nGroup] a;
  }

  model{
    a ~ normal(0,10);
    for (i in 1:N) {
    effect[i] ~ bernoulli_logit(a[gIndex[i]]);
    }
  }
 
  generated quantities{
    real y_draw[N];
    for (i in 1:N) {
      y_draw[i] = bernoulli_logit_rng(a[gIndex[i]]);
    }
  }
"

postExp <- stan(model_code = ms,
                data = dList,
                warmup = 1e3,
                iter = 4e3,
                cores = 1,
                chains = 1, 
                seed = 2334)

The summary…

print(postExp, pars = "a", probs = c(0.025, 0.975))

# output
#       mean se_mean   sd  2.5% 97.5% n_eff Rhat
# a[1] -1.50    0.01 0.58 -2.72 -0.46  2172    1
# a[2] -1.04    0.01 0.48 -2.02 -0.12  2415    1
# a[3]  1.46    0.01 0.58  0.44  2.68  1892    1

…all looks about right. I know this because when I convert these group log odds to probabilities the model predicted proportions of people saying their beverage made them feel better matches the raw probabilities from the descriptive statistics above pretty closely

logOdds <- as.data.frame(summary(postExp, pars = "a", probs = c(0.025, 0.975))$summary)$mean
logitToProb <- function(lo) exp(lo)/(1+exp(lo))
groupProbs <- sapply(1:3, function (i) round(logitToProb(logOdds[i])*100,1))
names(groupProbs) = c("water", "decaf", "caffeine")
groupProbs

# output
# water    decaf caffeine 
#  18.2     26.2     81.1

So the model seems to be working. But I need PPCs. So how to do them?

In order to generate the draws from the joint posterior I extracted the draws from the generated quantities block.

post_draw <- data.frame(rstan::extract(postExp, pars = "y_draw"))

There are 3000 draws from the joint posterior for each observed data point, which in this case is each participant. That is 3000 0s and 1s, hopefully in the correct group proportions for each participant within each group

head(post_draw[,1:6])

# output
 # y_draw.1 y_draw.2 y_draw.3 y_draw.4 y_draw.5 y_draw.6
 #        1        0        0        0        0        0
 #        0        0        0        0        1        0
 #        0        0        0        1        0        0
 #        1        0        1        0        0        1
 #        0        1        1        0        0        0
 #        0        0        0        0        0        0

So great, I have what I need, but what do I do with this matrix of draws from the posterior?

I can take means and get upper and lower HPDIs for each row of the matrix and then bind that to the actual data, but how to graph it?

I tried the ppc_error_binned function from the bayesplot package but to be honest I didn’t really know what I was doing, since the vignette’s all seem to do with questions that don’t really suit my problem (e.g. poisson regression or logit regression where each value of the outcome is a count of successes within several trials rather than a single trial [0,1] value for each participant).

Again an example PPC figure based on my own data with the code and an explanation would be very much appreciated. Again I have no right to expect this, just a hope.

Bob_Carpenter · December 7, 2019, 2:35pm

The answer to their question is “yes”—you want to look at the y_rep[i], and with binary data, all you get are counts.

Thanks for the vote of confidence, but we don’t quite deserve that much sympathy. Many of us, including me, are paid to work full-time on Stan (well, not quite full-time for me—I have to spend a fraction of my time writing grant proposals to keep the team employed at Columbia Uni).

I’d suggest using parallel naming for PPC simulations. In this case, use effect and effect_sim or y and y_sim. Each run of the Stan program will give you y_sim values. What I would do is plot sum(y_sim) as a histogram over multiple runs, then place a vertical line at sum(y). The posterior p-value will be the proportion of sum(y_sim) greater than (or equal to—this is discretes, so it matters) sum(y). That shouldn’t be too close to 0 or 1.

You need an R expert for help with R. @jonah will know how you can do it.

FJCC · December 7, 2019, 6:06pm

Try appending these lines to your code.

library(bayesplot)
postDF <- as.data.frame(postExp)

ppc_stat_grouped(y = as.integer(df$effect) - 1, yrep = as.matrix(postDF[4:65]), group = df$group, stat = mean)

The postDF[4:65] picks out all ot the y_draw[n] columns from the data frame of the posterior, so yrep is a 3000 x 62 matrix. I should have used your post_draw matrix in that place! The group parameter labels each of those columns with its corresponding treatment.

llewmills · December 7, 2019, 7:11pm

Thank you @FJCC!. Can you explain what I’m looking at here?

FJCC · December 7, 2019, 8:24pm

The histograms show the distributions of the fraction of simulated data that resulted in feelBetter in each of the 3000 posterior draws. In your post_draw matrix, the first 20 columns are simulations using the a value for water (a[1]), the next 22 come from the a value for decaf, and the remaining columns are caffeine results. If we take the mean value in one row of the first 20 columns, that gives the fraction of feelBetter results produced by simulating data with the a[1] from that particular draw. Doing that for all 3000 rows provides a distribution of the fraction of feelBetter outcomes given all of the a[1] values explored in the posterior, the inherent spread of a bernoulli process, and the fact that you observe 20 individuals.
The graph shows that your observed results, the black vertical lines, are near the most likely results that the model predicts but that quite a range of results is reasonably probable.
Below is a manual calculation of the same result starting with the postDF data frame from my code. I hope it is clearer than my attempt to explain the calculation. Note that the plot order is different than in the plot from bayesplot.

Ydraw <- postDF[4:65]
GroupMeans <- data.frame(water = rowMeans(Ydraw[, 1:20]), 
                         decaf= rowMeans(Ydraw[, 21:42]), 
                         caffeine = rowMeans(Ydraw[, 43:62]))
GroupMeansTall <- tidyr::gather(GroupMeans, key = "group", value = "Value")

ggplot(GroupMeansTall, aes(x = Value)) + geom_histogram(fill = "skyblue", color = "darkblue") +
  facet_wrap(~group)

llewmills · December 7, 2019, 9:00pm

Thank you so much @FJCC! You have made my day.

So the function takes the row means for all the columns of 3000 simulated draws assigned to observations from each group (the mean of all the 0s and 1s being the same as the proportion of 1s) and then plots each as a histogram? That makes sense!

Thank you again

llewmills · December 7, 2019, 9:07pm

Thank you @Bob Carpenter, both for your answer to my question and for shining a light on the recondite inner workings of the Stan project. I wish I could give the solution to both yours and FJCC’s answers.

Most grant applications involve justifying how the project will contribute to the greater good. I think that should be easy when it comes to stan and what you guys are doing at Columbia, it is such amazing work. Of course I am assuming the presence on the grant review panel of assessors people can take the broader view and who can see how better inferential statistics practices can benefit all of mankind.

llewmills · December 8, 2019, 6:53am

@Bob Carpenter FJCC suggested taking the group means of [0,1] draws from the posterior, but in your answer you suggested summing these. Is summing a better technique for posterior prediction than averaging in this case?

Bob_Carpenter · December 14, 2019, 7:16pm

All averaging does is divide by the number of items, so the only effect is on the axis labels. The posterior p-values, for example, don’t change.

Topic		Replies	Views
PPC for simple logistic regression in Stan Modeling	2	867	October 31, 2018
Simple PPC for logistic regression Modeling	9	1488	September 1, 2020
Posterior predictive plots for regression model without pooling Modeling bayesplot , loo	6	941	April 26, 2018
Posterior predictive check for binary outcome in Stan Modeling	2	825	November 10, 2020
Use Posterior_predict in rstanarm to generate probabilites for each observation in a logistic regression model Modeling	6	1351	February 22, 2019

Posterior Prediction from Logit Regression

Related topics