Posterior predictive check for binary outcome in Stan


Recently I have some observed binary data Y and I am interested in the posterior distribution check of the following simple model (this is a simplified version so there might be some mistakes in the Stan code):

vector[N] y;
vector[N] x;
model {
  p = invlogit(b1 + b2x);
  y ~ Bernouli(p); 

If I am interested in posterior predictive distribution check, my idea is in the generated quantities block adding the following:

generated quantities {
  vector[N] y_predcited;
  y_predicted ~ Bernouli(p); 

Then compare with y and y_predicted. I am not sure whether the above idea is correct or not. Also I am not sure how Stan uses HMC in predictive posterior distribution. For example for p we have 1000 simulated draws from the posterior, but how about y_predict? Will we have only N y_predicted values or 1000*N y_predicted values? How could we account for the uncertainty in y_predicted?


I edited your model some more. Not quite sure if this runs and works but it should be close:

data {
  int y[N]; // The data for a bernoulli must be an integer
            // vectors are only for doubles
  vector[N] x;

parameters {
  real b1;
  real b2;

model {
  b1 ~ normal(0, 1); // should always put priors on things
  b2 ~ normal(0, 1); // improper priors can lead to unexpected
                     // nastiness
  y ~ bernoulli_logit(b1 + b2 * x); // bernoulli_logit has
            // the inverse logit built in and the numerics are
            // more stable

generated quantities {
  int y_predicted[N];
  for(n in 1:N) {
    y_predicted[n] = bernoulli_logit_rng(b1 + b2 * x[n]); 

There’s info on logistic regressions here:

For the posterior predictive, generated quantities is evaluated once for every posterior draw you generate.

So if you generate 4000 posterior draws (the default 1000 draws with 4 chains), then you will get 4000 draws of the length N array y_predicted (so 4000xN integers).

More info on posterior prediction here:

The randomness there comes from two places, the uncertainty in b1 and b2 and the randomness in generating a bernoulli random number.

That make sense?


That’s very helpful and thanks!