# Using average of the posterior draws for posterior predictive checking?

I have a model which I fitted in Stan and now I am trying to determine how well it fits the data. To do so, I’ve been using the `bayesplot` package on the Stan objects.

Let us define the empirical value found in the data as `y` and the estimated value with Stan as `y_rep`. One of the plots provided by the package (`ppc_scatter_avg`) shows the scatterplot where on the y-axis we have the empirical value `y` and on the x-axis we not the estimated value `y_rep`, but instead the average of `y_rep` over all draws from the posterior.

Me and my supervisor are a bit confused by this, specifically by the usage of the average of `y_rep`. I thought that the way to do the posterior predictive checks would be to create a separate plot for each draw from the posterior, where the x-axis would show `y_rep` for that single draw.

Is using the average of `y_rep` just a shortcut in order to not have to look at dozens of individual plots? Is it justified to look at that type of a plot? My supervisor is asking me to provide some reference from the literature on why this is okay but I can’t seem to find any.

Maybe I am misreading your post but it sounds like posterior_predict vs posterior_epred (Expected Value of Posterior vs. Posterior of Expected Value with epred - #2 by scholz).

It sounds like you might be thinking of the `ppc_dens_overlay` function (if your outcome is continuous) or the `ppc_hist` function (if your outcome is counts). These functions take several draws from the posterior (ten by default) and then plot them along with the actual outcome from the data used to fit the model. This shows the extent to which the distributions of the posterior simulations match the actual distribution of the outcome.

`ppc_scatter_avg` compares the predicted value (using the mean of the posterior for each data point) against the actual value for each data point. The following example demonstrates this by summarizing the posterior predictions manually and superimposing them on top of the `ppc_scatter_avg` plot. I’m using `brms` to fit the model, so the function calls are slightly different, but `brms` is using the `bayesplot` functions under the hood.

``````library(tidyverse)
theme_set(theme_bw())
library(tidybayes)
library(patchwork)
library(brms)

# Fit a model
m = brm(count ~ Trt + (1|patient),
data=epilepsy,
family=poisson(),
backend="cmdstanr",
cores=4,
file="model")

pp = pp_check(m, "scatter_avg")  # Calls bayesplot::ppc_scatter_avg to generate the plot
#> Using all posterior draws for ppc type 'scatter_avg' by default.

# Get average of predicted draws and join to original data.
# predicted_draws uses posterior_predict under the hood and returns a tidy data frame
pred = predicted_draws(m, newdata=select(epilepsy, Trt, patient, count))
pred = pred %>%
group_by(Trt, patient) %>%
summarise(.prediction = mean(.prediction)) %>%
left_join(epilepsy)

# Superimpose manual predictions on ppc_scatter_avg to show they are the same
pp +
geom_point(data=pred, aes(.prediction, count),
colour="red", size=0.5, alpha=0.5)
`````` Here’s the posterior predictive check for this model using `ppc_hist`:

``````pp_check(m, type="hist", ndraws=5, breaks=seq(0,100,2)) # Calls bayesplot::ppc_hist
``````