I have a question regarding how the prediction samples are drawn if using the generated quantities block in stan. Take the following simple codes as an example.
data{
int<lower=0> T; // number of samples
int<lower=0, upper=1> y[T]; // observed samples
real<lower = 0> shape1; // hyperparameter for beta prior
real<lower = 0> shape2; // hyperparameter for beta prior
int<lower=0> T_new; // number of predictions
}
parameters{
real<lower=0, upper=1> p;
}
model{
p ~ beta(shape1, shape2);
y ~ bernoulli(p);
}
generated quantities{
int<lower=0, upper=1> y_new[T_new];
for (t in 1:T_new){
y_new[t] = bernoulli_rng(p);
}
}
I use beta prior with Bernoulli / binomial likelihood, and the new predictions are based on posterior of p, which we know is also a beta distribution.
For example, we only want to predict next observation, that T_new = 1. If we set iter = 2000, chains = 4 in sampling, I notice that Stan will draw the 4000 samples of parameter p, and 4000 samples of y_new, that is, the sample number of parameters and predicted quantities would be the same. So I am very curious about how stan does this.
I assume if we do prediction out of stan, we can sample from the posterior parameter space, and for each parameter sample,we draw a lot of samples for y_new, say if we draw 1000 parameters from the posterior space, and draw 1000 samples for predicted y_new per parameter sample, we should finally have 1 million samples for y_new, which form the posterior distribution of y_new given y, p(y_new|y).
But with Stan, it seems only one sample is drawn for the prediction per sample of parameter. And this puzzles me.
Thank you for your response! So you mean that when we use generated quantities with Stan to draw predicted samples, only one sample is drawn per one set of parameter samples. That seems not make sense to me. If only one sample of prediction is drawn per one sample of parameters, the total samples of predicted y_new cannot represent the predictive posterior distribution, right?
If I am reading the situation right, there is some conceptual confusion (or maybe that’s on me, let’s see). Your predictive distribution is a marginalization of the joint distribution P(y_{new},p|y)=P(y_{new}|p)P(p|y), so to sample y_{new} from the predictive, we can sample (y_{new},p) from the joint and discard the latent p's. Since P(p|y) is the posterior Stan is sampling from and P(y_{new}|p) is the code in your generated quantities block for T_new=1, you get one independent sample from the predictive distribution per sample from the posterior.
Now, if you sample N=1000 samples for each posterior sample we have P(y_{new}[1:N],p|y)=[\Pi_{i=1}^N P(y_{new}[i]|p) ]P(p|y)
If we marginalize over all y_{new}[i],i\neq j then we find that each j has the predictive distribution as its marginal distribution, yet the joint distribution will have a quite different covariance structure as the N points share the same p. So the samples are not i.i.d. from the predictive distribution. Even if you want to characterize a marginal distribution (like for T_new=1) the samples do not properly characterize it as they can be arbitrarily dependent on each other.
So I believe your out-of-Stan proposal is flawed, and that you have to make do with one predictive sample per posterior sample.
After carefully thinking through this, I agree with you. We can think y_{new} as a parameter, we are actually sampling from the joint posterior distribution of (y_{new}, p).
Thank you very much for helping clear up my confusion!
It’s conditionally independent, but the posterior draws may be correlated (or anti-correlated) leading to correlation (anti-correlation) in the predictive distribution.
Right—means of more than one draw have lower variance than the distributions from which they’re drawn.
That’s not necessarily true, because often we only care about expectations, and for simple expectations, all the mean calculations distribute. You can even calculate variance this way by storing X^2 as well as X.
But it rarely buys you much to take multiple predictve draws per posterior draw.
Even though they are draws from the correct marginal, I would be really wary of using high T_new. In OP’s example model, the posterior could (conceptually at least, albeit not practically due to the simplicity and conjugacy) be bimodally concentrated close to p=0 and p=1. Clearly, taking one sample of p and T_{new}=1000 would lead to disastrous posterior expectations since we’d get almost only zeros or only ones. With more posterior samples, the problem would decrease, but I’m guessing it would take more posterior samples than in the T_{new}=1 case to actually get the expectation right?
That’s right. If you take multiple predictive draws per parameter draw, the overall result is almost certainly going to be bound by the mixing of the parameter draws. You can’t help that by taking more predictive draws—you need to take enough parameter draws to mix.
We don’t want to think about getting expectations right so much as accurate to within MCMC sampling error. That error rate goes down quadratically with more samples in well behaved posteriors. I’m so happy to be able to insert \LaTeX that I just have to say \mathcal{O}(\frac{1}{\sqrt{n}}).