I have a model which I fitted in Stan and now I am trying to determine how well it fits the data. To do so, I’ve been using the
bayesplot package on the Stan objects.
Let us define the empirical value found in the data as
y and the estimated value with Stan as
y_rep. One of the plots provided by the package (
ppc_scatter_avg) shows the scatterplot where on the y-axis we have the empirical value
y and on the x-axis we not the estimated value
y_rep, but instead the average of
y_rep over all draws from the posterior.
Me and my supervisor are a bit confused by this, specifically by the usage of the average of
y_rep. I thought that the way to do the posterior predictive checks would be to create a separate plot for each draw from the posterior, where the x-axis would show
y_rep for that single draw.
Is using the average of
y_rep just a shortcut in order to not have to look at dozens of individual plots? Is it justified to look at that type of a plot? My supervisor is asking me to provide some reference from the literature on why this is okay but I can’t seem to find any.
Maybe I am misreading your post but it sounds like posterior_predict vs posterior_epred (Expected Value of Posterior vs. Posterior of Expected Value with epred - #2 by scholz).
It sounds like you might be thinking of the
ppc_dens_overlay function (if your outcome is continuous) or the
ppc_hist function (if your outcome is counts). These functions take several draws from the posterior (ten by default) and then plot them along with the actual outcome from the data used to fit the model. This shows the extent to which the distributions of the posterior simulations match the actual distribution of the outcome.
ppc_scatter_avg compares the predicted value (using the mean of the posterior for each data point) against the actual value for each data point. The following example demonstrates this by summarizing the posterior predictions manually and superimposing them on top of the
ppc_scatter_avg plot. I’m using
brms to fit the model, so the function calls are slightly different, but
brms is using the
bayesplot functions under the hood.
# Fit a model
m = brm(count ~ Trt + (1|patient),
pp = pp_check(m, "scatter_avg") # Calls bayesplot::ppc_scatter_avg to generate the plot
#> Using all posterior draws for ppc type 'scatter_avg' by default.
# Get average of predicted draws and join to original data.
# predicted_draws uses posterior_predict under the hood and returns a tidy data frame
pred = predicted_draws(m, newdata=select(epilepsy, Trt, patient, count))
pred = pred %>%
group_by(Trt, patient) %>%
summarise(.prediction = mean(.prediction)) %>%
# Superimpose manual predictions on ppc_scatter_avg to show they are the same
geom_point(data=pred, aes(.prediction, count),
colour="red", size=0.5, alpha=0.5)
Here’s the posterior predictive check for this model using
pp_check(m, type="hist", ndraws=5, breaks=seq(0,100,2)) # Calls bayesplot::ppc_hist