# Posterior predictive check for binary outcome in Stan

Hi,

Recently I have some observed binary data Y and I am interested in the posterior distribution check of the following simple model (this is a simplified version so there might be some mistakes in the Stan code):

``````data{
vector[N] y;
vector[N] x;
}
model {
p = invlogit(b1 + b2x);
y ~ Bernouli(p);
}
``````

If I am interested in posterior predictive distribution check, my idea is in the generated quantities block adding the following:

``````generated quantities {
vector[N] y_predcited;
y_predicted ~ Bernouli(p);
}
``````

Then compare with y and y_predicted. I am not sure whether the above idea is correct or not. Also I am not sure how Stan uses HMC in predictive posterior distribution. For example for p we have 1000 simulated draws from the posterior, but how about y_predict? Will we have only N y_predicted values or 1000*N y_predicted values? How could we account for the uncertainty in y_predicted?

Thx!

I edited your model some more. Not quite sure if this runs and works but it should be close:

``````data {
int y[N]; // The data for a bernoulli must be an integer
// vectors are only for doubles
vector[N] x;
}

parameters {
real b1;
real b2;
}

model {
b1 ~ normal(0, 1); // should always put priors on things
b2 ~ normal(0, 1); // improper priors can lead to unexpected
// nastiness
y ~ bernoulli_logit(b1 + b2 * x); // bernoulli_logit has
// the inverse logit built in and the numerics are
// more stable
}

generated quantities {
int y_predicted[N];
for(n in 1:N) {
y_predicted[n] = bernoulli_logit_rng(b1 + b2 * x[n]);
}
}
``````

There’s info on logistic regressions here: https://mc-stan.org/docs/2_25/stan-users-guide/logistic-probit-regression-section.html

For the posterior predictive, generated quantities is evaluated once for every posterior draw you generate.

So if you generate 4000 posterior draws (the default 1000 draws with 4 chains), then you will get 4000 draws of the length N array `y_predicted` (so 4000xN integers).

The randomness there comes from two places, the uncertainty in `b1` and `b2` and the randomness in generating a `bernoulli` random number.