I’d like to understand the theory underlying the prediction for a new observation in the “inhaler” ordinal regression model fitted using brms. The aim is to get the probability for each of the K = 4 response categories given the predictors, contained in a data.frame new_data
with 1 row. For simplicity, consider the case of having no group-level effects.
I understand the behavior of predict(inhaler_fit, newdata = new_data)
in the following way: For each of the already performed S posterior draws of \theta (say \theta_s), draw a new observation from p(y_{rep} | \theta_s), i.e. in my case using ordered_logistic_rng
. When looking only at the resulting S draws for y_{rep}, this gives the posterior predictive distribution for a new observation on the ordinal scale. Because of the default summary = TRUE
in predict.brmsfit()
, the proportions (i.e. averages) for each of the K = 4 response categories are returned, so this gives the probabilities P(y_{rep} = i | y) for i \in \{1, 2, 3, 4\}, estimated by sampling from p(y_{rep} | \theta_s) (in addition to sampling from the posterior p(\theta | y), of course). I considered the following procedure: For each posterior draw \theta_s, calculate P(y_{rep} = i | \theta_s) for i \in \{1, 2, 3, 4\} and afterwards average P(y_{rep} = i | \theta_s) over all s \in \{1, \dots, S\}. From some checks, I found out that this is returned by fitted(inhaler_fit, newdata = new_data)
. Shouldn’t this give P(y_{rep} = i | y) as well, since P(y_{rep} = i | y) = \int P(y_{rep} = i | \theta) p(\theta | y) d\theta = E_{\theta | y}(P(y_{rep} = i | \theta))? So, shouldn’t the second way (based on fitted.brmsfit()
) be equivalent to the first one (based on predict.brmsfit()
), but be more accurate since it is not based on sampling from p(y_{rep} | \theta_s)? I guess I am missunderstanding something as the documentation for fitted.brmsfit()
says: “By definition, these predictions have smaller variance than the response predictions performed by the predict method.” but I don’t see where I got it wrong.
I found out that my question is very similar to this one, but I still don’t understand why we can’t get the posterior predictive distribution through averaging over the likelihood at the S posterior draws. Or is my assumption p(y_{rep} | y) = \int p(y_{rep} | \theta) p(\theta | y) d\theta = E_{\theta | y}(p(y_{rep} | \theta)) wrong?