I have some questions regarding making predictions for new data.

How are Monte Carlo approximation to the posterior predictive and sampling from the posterior predictive related? I thought sampling was used for model checking, but it seems it is used for making predictions on new data, too.

According to the user guide section 27.5, posterior predictive for regression is p(\tilde y | \tilde x, y, x) = \int p(\tilde y | x, \theta) p(\theta | y, x) d\theta, and it can be approximately calculated using posterior samples as \frac{1}{M}\sum_{m=1}^M p(\tilde y | \tilde x, \theta^{(m)}) (eq.1) where \theta^{(m)} \sim p(\theta | x,y).

The user guide says that Stan gives predictive samples \tilde y^{(1)},...,\tilde y^{(M)} where \tilde y^{(m)}\sim p(\tilde y | \tilde x, x, y), and the mean of the posterior is approximately calculated as \mathbb{E}[\tilde y | \tilde x, x, y]=\int \tilde y p(\tilde y | \tilde x, \theta)p(\theta|x,y)d\theta \approx \frac{1}{M}\sum_{m=1}^M\tilde y^{(m)} (eq.2).

PyMC examples also make predictions by the mean of posterior predictive samples. I also checked out some books, forums etc. but I can’t see how these are the same.

When sampling from the posterior, we get a single sample \tilde y^{(m)} for each posterior sample \theta^{(m)}, right?

What is the proper way of making predictions when Laplace approximation or variational inference is used? In this case we don’t have the posterior samples, but we know the approximate posterior density. Do we just apply Monte Carlo and sample from the posterior, or sample from the posterior predictive as well? Or any other way?

The posterior predictive distribution is an entire probability distribution for each individual \tilde{x}_i. Whether or not this distribution is appropriate to summarize via its mean depends on the details of what you are using it for.

This is exactly equivalent to

That’s the standard way to do it, yeah. If you thin you might get fewer, and in some cases (when the conditional posterior predictive distribution is very heavy-tailed, but the posterior distributions of the parameters of the predictive distribution are well constrained) it can be informative to draw more than one sample for each posterior iteration.

Inasmuch as the approximations are valid, it will be valid in the MCMC sense to sample from the posterior and then sample from the resulting predictive distribution. In some of these cases, it might also be possible to analytically derive the posterior predictive distribution itself, which would also be a fine approach. However, in general I’d worry more about the approximating error in variational inference than about the MCMC error in the posterior predictive distribution, so I wouldn’t try to hard to get analytical results here!

Thank you very much for the explanations. I think I am better now, but I have a lot of follow up questions.

1.1) If we are only interested in the mean of the predictive, is sampling from the predictive easier than -approximately- calculating the distribution? We already have the posterior samples so we can approximate the posterior predictive (eq.1) and calculate its mean, why not do it?

1.2) Is there a proof for the approximate calculation of the mean of the posterior predictive (eq.2)? Intuitively I can see it is just the mean of samples, but the samples are drawn for each \theta^{(m)} which confuses me, and I can’t grasp it with the formula.

1.3) I have written \tilde y^{(m)}\sim p(\tilde y | \tilde x, x, y) according to the user manual section 27.5, but shouldn’t it be \tilde y^{(m)}\sim p(\tilde y | \tilde x, \theta^{(m)}) as in section 27.3? Or are they the same thing?

1.4) Is mean of the posterior predictive a reasonable choice for predictions (for linear or logistic regression, for instance)? I suppose it is (and it was used in the examples of both Stan and PyMC), but I can’t be sure.

3.1) Is sampling from the variational (or Laplace-approximated) posterior easier than sampling directly from the true posterior? When no analytical derivation is possible with VI, there should still be an advantage over MCMC, I suppose.

3.2) Do Stan and/or others always sample from the variational posterior even if analytical derivation is possible?

The easiest way to approximate the distribution is to sample from it. I think that the two options you are contrasting here are really the same thing.

I’m not sure how much formalism is required for a rigorous proof, but the intuition is fairly accessible. You see that it’s the mean of the samples; what’s missing is to see why the distribution of the samples approximates the posterior predictive distribution. Intuitively, you can think of the posterior predictive distribution as a mixture of the predictive distributions corresponding to every possible combination of fixed values that the parameters may take.

The first expression is less precise, and could be clarified by dropping the superscript on the left-hand side. p(\tilde{y}|\tilde{x},x,y) means the same thing as p(\tilde{y}|\tilde{x},\theta), and you’re right that identifying a particular iteration m on the left-hand side requires identifying the particular posterior iteration m on the right-hand side.

It totally depends on what you want to use the predictions for. For example, in logistic regression with a binary response, the mean of the posterior predictive distribution is guaranteed not to resemble any data that could conceivably arise (it will be some fraction, not zero or one). For posterior predictive checking, we don’t use the mean, we use the full distribution of the samples. For certain goodness-of-fit measures, it makes sense to do something intermediate; rather than using the mean of \tilde{y}^{(m)} taken over all m, we might use the expectation of \tilde{y}^{(m)} taken iteration-wise for each m, yielding a posterior distribution for the goodness-of-fit measure itself (for example r-squared). In some applications, for example in decision theoretic contexts, it might make sense to use the mean of \tilde{y}^{(m)} taken over all iterations m.

Conditional on samples from the posterior distribution of the paramters \theta, neither is easier or harder. In general variational/laplace approximations to posteriors can make the posteriors faster to compute. But faster doesn’t necessarily mean easier. If you have enough computer firepower, then it is definitely easier to consistently compute correct/accurate posteriors with full MCMC than with these approximations. And it’s easier to have confidence in the validity of the posterior when it comes from full MCMC with robust diagnostics.

Stan does whatever your Stan program says. If you want to code up the analytical derivation in your generated quantities and sample from that, you can do it.

In (3.1), I knew variational inference was faster, but for posterior predictive, we again have to sample (if no analytical derivation possible) on top of fitting the variational posterior. I assume it is still faster compared to MCMC.

Anyways, thank you very much for your time and detailed answers.