Simple PPC for logistic regression

Greetings all,

A while back I was asking about the simplest way to obtain a posterior predictive check in logistic regression. The ppc_error_binned function in Bayesplot doesn’t seem to work well as I think it requires proportions and not binary 1/0s. Any suggestions would be most appreciated.

Thanks,

David

My modelString is

modelString ="
data {
int <lower=0> n;
vector[n] age;
vector[n] male;
int <lower=0,upper=1> chd[n];
}
parameters {
real alpha;
real beta1;
real beta2;
}

transformed parameters {
real <lower=0> oddsbeta1;
real <lower=0> oddsbeta2;

oddsbeta1 = exp(beta1);
oddsbeta2 = exp(beta2);

}

model {
for (i in 1:n) {
chd[i] ~ bernoulli_logit(alpha + beta1male[i] + beta2age[i]);
}

// Priors
alpha ~ normal(0, .01);
beta1 ~ normal(2, .01);
beta2 ~ normal(.5, .01);
}

// To obtain the loo for logistic regression

generated quantities {
vector[n] log_lik;
int chd_rep[n];
for (i in 1:n) {
chd_rep[i] = bernoulli_logit_rng(alpha + beta1male[i] + beta2age[i]);
log_lik[i] = bernoulli_logit_lpmf(chd[i] | alpha + beta1male[i] + beta2age[i]);
}
}
"

1 Like

Hi David,

Do the responses in this thread cover what you’re after?

Hi Andrew,

The code I sent you got a bit messed up in the copying. In any case, the links that were in the previous post about PPCs for logistic regression were not helpful. Here is the code

modelString ="
data {
int <lower=0> n;
vector[n] age;
vector[n] male;
int <lower=0,upper=1> chd[n];
}
parameters {
real alpha;
real beta1;
real beta2;
}

transformed parameters {
real <lower=0> oddsbeta1;
real <lower=0> oddsbeta2;

oddsbeta1 = exp(beta1);
oddsbeta2 = exp(beta2);

}

model {
for (i in 1:n) {
chd[i] ~ bernoulli_logit(alpha + beta1age[i] + beta2male[i]);
}

// Priors
alpha ~ normal(0, .01);
beta1 ~ normal(2, .01);
beta2 ~ normal(.5, .01);
}

generated quantities {
real chd_rep[n];
for (i in 1:n) {
chd_rep[i] = bernoulli_logit_rng(alpha + beta1age[i] + beta2male[i]);
}

}
"

Everything runs just fine, but when I run the command

ppc_stat(y = chd, yrep = chd_rep, stat = “mean”)

I get the following message

Error: object ‘chd_rep’ not found

Thanks for looking at this.

David

Hi David,

The error:

Error: object ‘chd_rep’ not found

Implies that you haven’t extracted the chd_rep estimates from the stanfit object before calling ppc_stat.

So given a stan model fitted via:

fitt_mod = stan(model, data)

You need to extract chd_rep into a matrix before passing it to ppc_stat, you can do this via:

chd_rep = extract(fitt_mod,"chd_rep")$chd_rep
ppc_stat(y=chd,yrep=chd_rep)

Hi Andrew,

That did the trick, but I’m still left with the problem of finding the correct way to plot chd against chd_rep. Every plot I try in ppc_ gives very strange figures. chd is coded 1/0, but the model should give the logit of chd and the logit of chd_rep, so a nice simple ppc density or histogram should come out. For example, when I run the following

chd_rep = extract(myfitCHD,“chd_rep”)$chd_rep
ppc_dens_overlay(y = chd, yrep = chd_rep)

I get the following plot which seems strange to me.

This is what I meant in my first email about not finding a good way to display ppc’s for logistic regression.

Thanks so much for your help and patience.

David

No worries! This is an area where @jonah would have some good advice. Jonah is there a recommended way to set up the ppc plotting for binary data from a binomal model?

2 Likes

Yeah if you have a binary outcome it can be tricky to think of how to plot it. It’s easier to make plots on the logit scale if you have a binomial outcome with number of trials > 1. But for binary data there are some other options.

Using an rstanarm model as an example:

library(bayesplot)
library(rstanarm)

# switch is 0/1
fit <- stan_glm(switch ~ I(dist/100) + log(arsenic), data = wells, family = binomial)
y <- wells$switch
yrep <- posterior_predict(fit) # draws from the posterior predictive distribution 

One thing you can look at is Pr(y = 1) by comparing the proportion of 1s in the data vs the proportions of 1s in the posterior predictive distribution:

ppc_stat(y, yrep, stat = "mean") 

You can look at the distribution of other functions too using ppc_stat. The stat argument can be any function that takes in a vector and returns a scalar so, e.g. stat = "sd" would work too.

If you want to look at the distribution of 0s and 1s itself you can use ppc_bars():

ppc_bars(y, yrep)

There are also the ppc_stat_grouped() and ppc_bars_grouped() functions that let you make each of these plots for separate levels of a grouping variable, if that exists in your model.

Hi Jonah,

That really helps. So, when I see a plot like this

it is telling me that while have about 43% of the sample with heart disease, the model predicts 100% of the sample to have chd. Fairly high number of false positives.

Thanks

David

Glad that’s helpful. Yeah it looks like the sample is somewhat balanced but the model is predicting all 1s and no 0s. Presumably the value of

inv_logit(alpha + beta1*age[i] + beta2*male[i])

is coming out to 1 or nearly 1 every time it’s calculated. One thing to consider is that this prior with mean 2 and SD 0.01

is super informative, implying that you pretty much know for sure that beta1 is between 1.97 and 2.03. If that’s true from the prior information you have then that’s fine, but 2 is pretty big on the logit scale so this is probably contributing to the probability being so close to 1.

1 Like

And that’s because those are not the write priors. :-/ Everything looks more sensible now. Thanks.

David

1 Like