Ordered Beta Regression model: is a custom posterior predictive check plot necessary?

Hi Arthur -

At present there is no way to truly visualize the distribution using conventional plots. I show here how to do it with the bayesplot, but it still isn’t perfect. I plan to add a more accurate posterior prediction visualization method to the next version of the ordbetareg package. However, at present, a better way of doing predictive visualization would be as the following:

pacman::p_load(ordbetareg,
               brms,
               ggplot2,
               patchwork,
               bayesplot)
data("pew")

model_data <- select(pew,therm,age="F_AGECAT_FINAL",
                     sex="F_SEX_FINAL",
                     income="F_INCOME_FINAL",
                     ideology="F_IDEO_FINAL",
                     race="F_RACETHN_RECRUITMENT",
                     education="F_EDUCCAT2_FINAL",
                     region="F_CREGION_FINAL",
                     approval="POL1DT_W28",
                     born_again="F_BORN_FINAL",
                     relig="F_RELIG_FINAL",
                     news="NEWS_PLATFORMA_W28") %>%
  mutate_at(c("race","ideology","income","approval","sex","education","born_again","relig"), function(c) {
    factor(c, exclude=levels(c)[length(levels(c))])
  }) %>%
  # need to make these ordered factors for BRMS
  mutate(education=ordered(education),
         income=ordered(income))

# Fit model
# ord_fit_mean <- ordbetareg(formula=therm ~ mo(education) + mo(income) + (1|region),
#                            data=model_data,
#                            cores=2,chains=2,iter=1000)

# Instead of fitting the model, one can load it through the ordbetareg package
data("ord_fit_mean")

# need to get the predicted outcome distribution and plot it separately for discrete versus continuous responses
# use bayesplot to do plotting manually

# determine which indicators are discrete (equal to 0 or 1)

discrete <- ord_fit_mean$data$therm %in% c(0,1)
outcome <- ord_fit_mean$data$therm

# predicted outcome

pred_out <- posterior_predict(ord_fit_mean)

# continuous responses
# use sample of 100 iterations

ppc_dens_overlay(y=outcome[!discrete],
         yrep=pred_out[sample(1:nrow(pred_out),100),!discrete]) +
  theme(text=element_text(family=""))

# discrete responses
# use sample of 100 iterations
# use round function to force observations to be whole numbers

ppc_bars(y=outcome[discrete],
         yrep=matrix(as.integer(round(pred_out[sample(1:nrow(pred_out),100),discrete])),
                  ncol=sum(discrete), nrow=100,byrow = T)) +
  theme(text=element_text(family=""))

That gives you the following plots for continuous and discrete responses accordingly:


I use the round function to convert the draws from ordbetareg for discrete responses that are not in fact discrete. This is a bit of a hack; it is probably possible to implement a weighting across all responses, but in any case it works fine for the purposes of visualizing the predictive distribution. As can be seen, the fit to the continuous values is inexact but the discrete responses are fairly close.

Part of the reason that the continuous values do not fit well is because little effort was made in this example to create a well-fitting model. That does not just depend on the distribution but also a well-specified linear model. The linear model in this case is fairly ad hoc. The large mode at 0.5 reflects non-response of a kind in which respondents left the slider in the default position. All other relevant models, such as OLS, will likewise exhibit poor fit to this unconditional distribution.

The other part of the reason is programmatic – a better visualization would consider the probability that a discrete outcome is continuous, which can occur in the model (such as 1 being 0.98, etc). That is why there is a big drop in the right side of the continuous plot – the predictive distribution allows for 0.99 and 0.98, but these are not being shown if the original data was discrete.

In any case, just because fit has some issues, it is still possible to do inference, especially as some of these modes, such as the one at 0.5, are not of importance to the analysis. The posterior distribution is essentially showing that all values in the outcome are more likely than the observed data, which in this case makes sense as the bunching of values at whole numbers is an artifact of human cognitive processing/satisficing rather than something we want to model.

The reason for including this data in the package was to demonstrate the package’s use with a realistic dataset. You should check for posterior predictive fit for your intended application rather than using the package data as a metric.

All best -

Bob

2 Likes