Posterior predictive distribution for ordinal regression


#1

I’d like to understand the theory underlying the prediction for a new observation in the “inhaler” ordinal regression model fitted using brms. The aim is to get the probability for each of the K = 4 response categories given the predictors, contained in a data.frame new_data with 1 row. For simplicity, consider the case of having no group-level effects.

I understand the behavior of predict(inhaler_fit, newdata = new_data) in the following way: For each of the already performed S posterior draws of \theta (say \theta_s), draw a new observation from p(y_{rep} | \theta_s), i.e. in my case using ordered_logistic_rng. When looking only at the resulting S draws for y_{rep}, this gives the posterior predictive distribution for a new observation on the ordinal scale. Because of the default summary = TRUE in predict.brmsfit(), the proportions (i.e. averages) for each of the K = 4 response categories are returned, so this gives the probabilities P(y_{rep} = i | y) for i \in \{1, 2, 3, 4\}, estimated by sampling from p(y_{rep} | \theta_s) (in addition to sampling from the posterior p(\theta | y), of course). I considered the following procedure: For each posterior draw \theta_s, calculate P(y_{rep} = i | \theta_s) for i \in \{1, 2, 3, 4\} and afterwards average P(y_{rep} = i | \theta_s) over all s \in \{1, \dots, S\}. From some checks, I found out that this is returned by fitted(inhaler_fit, newdata = new_data). Shouldn’t this give P(y_{rep} = i | y) as well, since P(y_{rep} = i | y) = \int P(y_{rep} = i | \theta) p(\theta | y) d\theta = E_{\theta | y}(P(y_{rep} = i | \theta))? So, shouldn’t the second way (based on fitted.brmsfit()) be equivalent to the first one (based on predict.brmsfit()), but be more accurate since it is not based on sampling from p(y_{rep} | \theta_s)? I guess I am missunderstanding something as the documentation for fitted.brmsfit() says: “By definition, these predictions have smaller variance than the response predictions performed by the predict method.” but I don’t see where I got it wrong.

I found out that my question is very similar to this one, but I still don’t understand why we can’t get the posterior predictive distribution through averaging over the likelihood at the S posterior draws. Or is my assumption p(y_{rep} | y) = \int p(y_{rep} | \theta) p(\theta | y) d\theta = E_{\theta | y}(p(y_{rep} | \theta)) wrong?


Sampling from the posterior predictive distribution
#2

I now ran some tests to check if the two approaches for inferring the posterior predictive distribution (PPD) yield the same results (or at least if their results are similar, due to the additional stochastic nature of the first approach).

For the sake of clarity, I will denote the second approach (inferring the PPD through averaging over the likelihoods at the S posterior draws \theta_s) by “PPD_likAvg” and the first one (the standard approach of sampling once from p(y_{rep}|\theta_s) for each \theta_s) by “PPD_likSample”. As already mentioned, I used fitted(inhaler_fit, newdata = new_data) to perform the PPD_likAvg approach for the ordinal regression. For another example with a Gaussian likelihood, I used the following function for calculating the density of the PPD in the PPD_likAvg approach:

dppd_likAvg <- function(y, newdata = NULL, brmsfit_obj){
  linPred_post <- fitted(brmsfit_obj, newdata = newdata, scale = "linear", summary = FALSE)
  fit_draws_brms <- extract_draws(brmsfit_obj)
  sigma_post <- fit_draws_brms$dpars$sigma
  S <- nrow(linPred_post)
  apply(linPred_post, 2, function(linPred_i){
    likNonAvg <- sapply(seq_len(S), function(s){
      dnorm(y, mean = linPred_i[s], sd = sigma_post[s])
    })
    likAvg <- rowMeans(likNonAvg)
  })
}

The tests showed that the two approaches indeed give very similar results, for both the ordinal regression example (where the likelihood is a probability mass function with finite support, so it can be easily written down as a vector of K = 4 probabilities) and the Gaussian example (where I plotted the density calculated by dppd_likAvg() on a fine grid of y values and the kernel density estimate of the PPD_likSample draws into a common plot). The differences are so small that I guess the two approaches should indeed give the same results in the limit. With “in the limit”, I mean for the PPD_likSample approach to sample infinite times - instead of just once - from p(y_{rep}|\theta_s) for each \theta_s.

Additionally, the pages 168-169 of the book “Bayesian Data Analysis” (3rd edition) by Gelman et al. (2014) should also show that the PPD_likAvg approach is a valid approach for inferring the PPD.

So, to come back to my question: If the PPD_likAvg approach is valid, this means that fitted(inhaler_fit, newdata = new_data) does indeed give the PPD for the ordinal regression example. So I think this should be clarified in the documentation for fitted.brmsfit() which says: “By definition, these predictions have smaller variance than the response predictions performed by the predict method.”. Of course, this statement is correct in most of the cases, but for this particular case, it seems like fitted.brmsfit() can indeed be used for inferring the PPD. I would even prefer fitted.brmsfit() over predict.brmsfit() since it doesn’t require to sample from the likelihood. In my ordinal regression example, I used the “cumulative” family, but perhaps this is also the case for other response families (maybe also non-ordinal families). Perhaps it would be helpful if @paul.buerkner could give his opinion about this whole issue.


#3

In categorical-like models (to which ordinal models belong), I think you are right that we can compute the PPD from the category probabilities and, what is more, can do so without adding the sampling error of predict. I don’t want to add this to the documentation since it will confuse users even more about what fitted and predict do and will likely be relevant only for very few models and users.