Posterior predictive check for binary outcome in Stan

stan_beginer · November 9, 2020, 5:17pm

Hi,

Recently I have some observed binary data Y and I am interested in the posterior distribution check of the following simple model (this is a simplified version so there might be some mistakes in the Stan code):

data{
vector[N] y;
vector[N] x;
}
model {
  p = invlogit(b1 + b2x);
  y ~ Bernouli(p); 
}

If I am interested in posterior predictive distribution check, my idea is in the generated quantities block adding the following:

generated quantities {
  vector[N] y_predcited;
  y_predicted ~ Bernouli(p); 
}

Then compare with y and y_predicted. I am not sure whether the above idea is correct or not. Also I am not sure how Stan uses HMC in predictive posterior distribution. For example for p we have 1000 simulated draws from the posterior, but how about y_predict? Will we have only N y_predicted values or 1000*N y_predicted values? How could we account for the uncertainty in y_predicted?

Thx!

bbbales2 · November 10, 2020, 3:13pm

I edited your model some more. Not quite sure if this runs and works but it should be close:

data {
  int y[N]; // The data for a bernoulli must be an integer
            // vectors are only for doubles
  vector[N] x;
}

parameters {
  real b1;
  real b2;
}

model {
  b1 ~ normal(0, 1); // should always put priors on things
  b2 ~ normal(0, 1); // improper priors can lead to unexpected
                     // nastiness
  y ~ bernoulli_logit(b1 + b2 * x); // bernoulli_logit has
            // the inverse logit built in and the numerics are
            // more stable
}

generated quantities {
  int y_predicted[N];
  for(n in 1:N) {
    y_predicted[n] = bernoulli_logit_rng(b1 + b2 * x[n]); 
  }
}

There’s info on logistic regressions here: 1.5 Logistic and Probit Regression | Stan User’s Guide

For the posterior predictive, generated quantities is evaluated once for every posterior draw you generate.

So if you generate 4000 posterior draws (the default 1000 draws with 4 chains), then you will get 4000 draws of the length N array y_predicted (so 4000xN integers).

More info on posterior prediction here: 24 Posterior Predictive Sampling | Stan User’s Guide

The randomness there comes from two places, the uncertainty in b1 and b2 and the randomness in generating a bernoulli random number.

That make sense?

stan_beginer · November 10, 2020, 3:48pm

That’s very helpful and thanks!

Topic		Replies	Views
Posterior Predictive Checks After Sampling Modeling	3	820	October 23, 2022
Predictions by using the generated quantities block in Stan General	7	6052	March 15, 2018
How to generate posterior predictive distribution for hierarchical occupancy model in Stan? RStan	14	1191	April 21, 2022
Generating Counterfactual Posterior Predictive Distributions Using the Generated Quantities Block General	6	1158	June 17, 2019
Posterior predictions rstan - unintended computations, artifacts General	8	730	February 18, 2022

Posterior predictive check for binary outcome in Stan

Related topics