Ordered Beta Regression model: is a custom posterior predictive check plot necessary?

In this recent brilliant article, Dr. Kubinec proposed a new Bayesian model:

  • Ordered Beta Regression

" for continuous distributions with both lower and upper bounds, such as data arising from survey slider scales, visual analog scales, and dose-response relationships", further explained below. The package uses brms under the hood.

This model employs the cutpoint technique popularized by ordered logit to fit a single linear model to both continuous (0,1) and degenerate [0,1] responses. The model can be estimated with or without observations at the bounds, and as such is a general solution for this type of data. Employing a Monte Carlo simulation, I show that the model is noticeably more efficient than ordinary least squares regression, zero-and-one-inflated beta regression, re-scaled beta regression and fractional logit while fully capturing nuances in the outcome. I apply the model to a replication of the Aidt and Jensen (2012) study of suffrage extensions in Europe. The model can be fit with the R package ordbetareg to facilitate hierarchical, dynamic and multivariate modeling.

Given these exciting characteristics, I decided to fit the Ordered Beta Regression model described in ordbetareg’s vignette.

Following the vignette, I plotted a density PPC, interpreted by Dr. Kubinec as

“The model can’t capture all of the modality in the distribution – there are effectively four separate modes – but it is reasonably accurate over the middle responses and the responses near the bounds.”

The density plot doesn’t seem ideal for this model, probably because the model corresponds to a combination of continuous + discrete outcomes.

I thus tried these PPCs:

Given these 5 different PPC plots, my first impression was that “this indicates a failure of the model to describe an aspect of the data” (Gelman et al., 2020).

However, Dr. Kubinec suggested that a PPC depicticing a combination of histogram (for continuous values), and bar plot (for 0/1) would be better for the Ordered Beta Regression model. Unfortunately I do not have the abilities necessary to create such plot.

Question

I would like to use this case study to ask:

  • How can I create the custom PPC plot mentioned above?
  • How do you interpret the 5 PPC plots shown above?

Lastly, I’d like to highlight that my ultimate goal is to understand the Ordered Beta Regression model better so I can use it in future analyses. It looks extremely promising.

Code:

pacman::p_load(ordbetareg,
               brms,
               ggplot2,
               patchwork)
data("pew")

model_data <- select(pew,therm,age="F_AGECAT_FINAL",
                     sex="F_SEX_FINAL",
                     income="F_INCOME_FINAL",
                     ideology="F_IDEO_FINAL",
                     race="F_RACETHN_RECRUITMENT",
                     education="F_EDUCCAT2_FINAL",
                     region="F_CREGION_FINAL",
                     approval="POL1DT_W28",
                     born_again="F_BORN_FINAL",
                     relig="F_RELIG_FINAL",
                     news="NEWS_PLATFORMA_W28") %>% 
  mutate_at(c("race","ideology","income","approval","sex","education","born_again","relig"), function(c) {
    factor(c, exclude=levels(c)[length(levels(c))])
  }) %>% 
  # need to make these ordered factors for BRMS
  mutate(education=ordered(education),
         income=ordered(income))

# Fit model 
# ord_fit_mean <- ordbetareg(formula=therm ~ mo(education) + mo(income) + (1|region), 
#                            data=model_data,
#                            cores=2,chains=2,iter=1000)

# Instead of fitting the model, one can load it through the ordbetareg package
data("ord_fit_mean")

pp_check(ord_fit_mean)

pp_check(ord_fit_mean, type = "ecdf_overlay") + 
  pp_check(ord_fit_mean, type = "hist") + 
  plot_annotation(tag_levels = "A")

pp_check(ord_fit_mean, type = "stat") + 
  pp_check(ord_fit_mean, type = "stat_2d") +
  plot_annotation(tag_levels = "A")

can you add in the code you used to make the plots?

1 Like

Code is shown at the end of the post above

pacman::p_load(ordbetareg,
               brms,
               ggplot2,
               patchwork)
data("pew")

model_data <- select(pew,therm,age="F_AGECAT_FINAL",
                     sex="F_SEX_FINAL",
                     income="F_INCOME_FINAL",
                     ideology="F_IDEO_FINAL",
                     race="F_RACETHN_RECRUITMENT",
                     education="F_EDUCCAT2_FINAL",
                     region="F_CREGION_FINAL",
                     approval="POL1DT_W28",
                     born_again="F_BORN_FINAL",
                     relig="F_RELIG_FINAL",
                     news="NEWS_PLATFORMA_W28") %>% 
  mutate_at(c("race","ideology","income","approval","sex","education","born_again","relig"), function(c) {
    factor(c, exclude=levels(c)[length(levels(c))])
  }) %>% 
  # need to make these ordered factors for BRMS
  mutate(education=ordered(education),
         income=ordered(income))

# Fit model 
# ord_fit_mean <- ordbetareg(formula=therm ~ mo(education) + mo(income) + (1|region), 
#                            data=model_data,
#                            cores=2,chains=2,iter=1000)

# Instead of fitting the model, one can load it through the ordbetareg package
data("ord_fit_mean")

pp_check(ord_fit_mean)

pp_check(ord_fit_mean, type = "ecdf_overlay") + 
  pp_check(ord_fit_mean, type = "hist") + 
  plot_annotation(tag_levels = "A")

pp_check(ord_fit_mean, type = "stat") + 
  pp_check(ord_fit_mean, type = "stat_2d") +
  plot_annotation(tag_levels = "A")

I am not sure how to make the custom plot you ask for, but the histogram seems to do a good job already. The density plot isn’t good for this data (as you noted).

I would be interested to see how the zero-one-mid-inflated beta model works on this data from the vignette, if that spike near the midpoint is right at 0.5 and represents, say, neutral responses from a slider scale.

Note this comment from Michael Betancourt here regarding mixture models:
"In other words because the the inflated and non-inflated points are essentially modeled by separate data generating processes the entire model decomposes into a binomial model for the total inflated observations and a baseline model for the non-zero observations.

Because these two models are independent we can fit them independently or jointly, whichever is more convenient. In particular if the inflated counts are just a nuisance then we can ignore them entirely and just fit the non-inflated observations directly without any consideration of the inflation!"

That’s worth thinking about when you are fitting some slider scale data where people tend to plop the slider at units of 10 or whatever, with few responses in between.

1 Like

Thanks for sharing this model @arthur-albuquerque I’d heard of people constructing these type of models but had not used them myself. I previously did some stuff using a typical beta regression, using the transformation approach of just squishing the extreme numbers that were at 0 or 1 to fall between 0 and 1 but was never quite satisfied with that!

I think this is probably correct:
"Given these 5 different PPC plots, my first impression was that “this indicates a failure of the model to describe an aspect of the data”

I know this doesn’t help for the current data that you have, but after running some experiments using slider scales and similar response options, I personally felt like people were essentially just treating the slider scale as ordinal - they usually just pick specific increments, and you can see in your PPC plot that almost everyone is picking something like 0, 25, 50, ~62.5, 75, ~80, and 100. I ultimately started to feel I’d do a better job not by trying to come up with really smart ways of analysing the slider data, but instead presenting people with an ordinal response format that seems to match how they actually approach the issue, and just analyse that instead!

However, would definitely be interested to see if you make further devleopments with implementing and understanding this approach in brms

1 Like

I’ve been experimenting with ordered beta and zero-one-inflated-beta (ZOIB) models for some education data I’ve been working with and have been trying to come up with custom PPC plots as well. Below is an example where the posterior draws and the actual outcome data are plotted together on each facet. I like this a bit better than the built-in pp_check(..., type="hist") version because I find it easier to compare the model to the data.

library(tidyverse)
theme_set(theme_classic())
library(brms)
library(ordbetareg)

data("ord_fit_mean")

# See function code below
plot_posterior(ord_fit_mean) 

Each plot is a histogram of one simulated outcome from the posterior (blue bars) and a histogram of the actual outcome data (red bars). The area of overlap between posterior and data is purple (the choice of colors could probably be improved to get better contrast on the area of overlap). Red areas are where the model underpredicts relative to the data. Blue areas are where the model overpredicts relative to the data. The bar for the zeros is plotted just to the left of zero and the bar for the ones is plotted just to the right of one.

Just to show how this plot looks when the fit is a bit better, here’s another version of this plot using some data I simulated from a ZOIB distribution and then fit with ordered beta regression:

# sim.ordbeta is a brm model object fit with simulated ZOIB data
plot_posterior(sim.ordbeta)

Code for plot_posterior:

#' Plot histograms of the posterior simulations from ZOIB and ordered beta models
#'
#' @param model A brms model object
#' @param nbins The number of histogram bins. Default: 30.
#' @param ndraws Number of simulations from the posterior (default: 9). Note, this will result in `ndraws` separate plot panels.
#' @param ncol Number of columns in the facetted plot output
#' @param add.means Should vertical mean lines be added? Default: `FALSE`.
#'
#' @return
#' Multipanel histogram (one for each draw) with predictions in blue and actual data overlaid in red. Purple represents areas of overlap between model and data. If `add.means=TRUE`, dashed red vertical line is data mean; blue vertical line is predicted mean.
plot_posterior = function(model, nbins=30, ndraws=9,
                          ncol=NULL, add.means=FALSE) {

  # Get actual outcome data
  actual = model$data
  names(actual)[1] = "data"

  # Get posterior simulations
  posterior.predictions = sim_posterior(model=model, ndraws=ndraws)

  pad = 1/nbins
  breaks = c(-pad,
             seq(0, 1 - 1/nbins, length=nbins),
             1 - 0.001*pad,
             1 + pad)

  # Plot predictions and overlay actual data
  p = posterior.predictions %>%
    ggplot(aes(x = value)) +
    geom_histogram(breaks=breaks, fill=hcl(240,100,60), alpha=0.5,
                   color="white", size=0.2) +
    geom_histogram(data=actual, aes(data),
                   breaks=breaks, fill=hcl(0,100,60), alpha=0.5,
                   colour="white", size=0.2) +
    geom_vline(xintercept=c(0,1), colour="white", size=1.2) +
    geom_hline(yintercept=0, size=1) +
    labs(x=NULL, y=NULL) +
    scale_y_continuous(breaks=NULL, limits=c(0,NA), expand=expansion(c(0,0.03))) +
    scale_x_continuous(limits=c(-pad, 1 + pad), expand=expansion(c(0.01, 0.01))) +
    facet_wrap(~ sim, ncol=ncol, labeller=label_both) +
    theme(
      strip.background=element_blank(),
      strip.text=element_blank()
    )

  if(add.means) {
    p = p +
      geom_vline(xintercept=mean(model$data[,1]), linetype=2, colour="red") +
      geom_vline(data=. %>%
                   summarise(value=mean(value)),
                 aes(xintercept=value), colour="blue")
  }

  p
}

#' Get posterior simulations as a long data frame
#'
#' @param model The model object returned by `brm`
#' @param ndraws Number of simulated datasets to draw from the posterior
sim_posterior = function(model, ndraws) {

  posterior_predict(model, ndraws = ndraws) %>%
    data.frame() %>%
    mutate(sim = 1:n()) %>%
    pivot_longer(-sim)
}

I liked this; really visually appealing. Thanks! In brief, it’s a redevelopment of brms::pp_check(..., type = "hist")

Yes, similar idea to pp_check(..., type="hist") (from the bayesplot package; brms uses the bayesplot functions), except (1) each panel includes both the actual data and one draw from the posterior, and (2) the zeros and ones are plotted separately from the values that are between zero and one.

1 Like

Hi Arthur -

At present there is no way to truly visualize the distribution using conventional plots. I show here how to do it with the bayesplot, but it still isn’t perfect. I plan to add a more accurate posterior prediction visualization method to the next version of the ordbetareg package. However, at present, a better way of doing predictive visualization would be as the following:

pacman::p_load(ordbetareg,
               brms,
               ggplot2,
               patchwork,
               bayesplot)
data("pew")

model_data <- select(pew,therm,age="F_AGECAT_FINAL",
                     sex="F_SEX_FINAL",
                     income="F_INCOME_FINAL",
                     ideology="F_IDEO_FINAL",
                     race="F_RACETHN_RECRUITMENT",
                     education="F_EDUCCAT2_FINAL",
                     region="F_CREGION_FINAL",
                     approval="POL1DT_W28",
                     born_again="F_BORN_FINAL",
                     relig="F_RELIG_FINAL",
                     news="NEWS_PLATFORMA_W28") %>%
  mutate_at(c("race","ideology","income","approval","sex","education","born_again","relig"), function(c) {
    factor(c, exclude=levels(c)[length(levels(c))])
  }) %>%
  # need to make these ordered factors for BRMS
  mutate(education=ordered(education),
         income=ordered(income))

# Fit model
# ord_fit_mean <- ordbetareg(formula=therm ~ mo(education) + mo(income) + (1|region),
#                            data=model_data,
#                            cores=2,chains=2,iter=1000)

# Instead of fitting the model, one can load it through the ordbetareg package
data("ord_fit_mean")

# need to get the predicted outcome distribution and plot it separately for discrete versus continuous responses
# use bayesplot to do plotting manually

# determine which indicators are discrete (equal to 0 or 1)

discrete <- ord_fit_mean$data$therm %in% c(0,1)
outcome <- ord_fit_mean$data$therm

# predicted outcome

pred_out <- posterior_predict(ord_fit_mean)

# continuous responses
# use sample of 100 iterations

ppc_dens_overlay(y=outcome[!discrete],
         yrep=pred_out[sample(1:nrow(pred_out),100),!discrete]) +
  theme(text=element_text(family=""))

# discrete responses
# use sample of 100 iterations
# use round function to force observations to be whole numbers

ppc_bars(y=outcome[discrete],
         yrep=matrix(as.integer(round(pred_out[sample(1:nrow(pred_out),100),discrete])),
                  ncol=sum(discrete), nrow=100,byrow = T)) +
  theme(text=element_text(family=""))

That gives you the following plots for continuous and discrete responses accordingly:


I use the round function to convert the draws from ordbetareg for discrete responses that are not in fact discrete. This is a bit of a hack; it is probably possible to implement a weighting across all responses, but in any case it works fine for the purposes of visualizing the predictive distribution. As can be seen, the fit to the continuous values is inexact but the discrete responses are fairly close.

Part of the reason that the continuous values do not fit well is because little effort was made in this example to create a well-fitting model. That does not just depend on the distribution but also a well-specified linear model. The linear model in this case is fairly ad hoc. The large mode at 0.5 reflects non-response of a kind in which respondents left the slider in the default position. All other relevant models, such as OLS, will likewise exhibit poor fit to this unconditional distribution.

The other part of the reason is programmatic – a better visualization would consider the probability that a discrete outcome is continuous, which can occur in the model (such as 1 being 0.98, etc). That is why there is a big drop in the right side of the continuous plot – the predictive distribution allows for 0.99 and 0.98, but these are not being shown if the original data was discrete.

In any case, just because fit has some issues, it is still possible to do inference, especially as some of these modes, such as the one at 0.5, are not of importance to the analysis. The posterior distribution is essentially showing that all values in the outcome are more likely than the observed data, which in this case makes sense as the bunching of values at whole numbers is an artifact of human cognitive processing/satisficing rather than something we want to model.

The reason for including this data in the package was to demonstrate the package’s use with a realistic dataset. You should check for posterior predictive fit for your intended application rather than using the package data as a metric.

All best -

Bob

2 Likes

Hi Bob,

Your code and reply were very insightful.

Thanks.

1 Like