I am dealing with a bit of a contrived example of ordinal response data, that was purposefully engineered to grossly violate the proportional odds assumption. In the example dataset, we have three varieties of plant that have different rating categories, C (worst), B, and A (best). The number of plants in each variety rated in each of the three categories is given for 12 different growers. As you can see from the raw data plot, Variety 1 has roughly equal observed proportion of ratings in the three categories. Variety 2 has mostly B rating, and Variety 3 has mostly A and C rating with few B. This means that the proportional odds assumption is violated because the relative probabilities of the ratings are very different between the varieties. But Variety 3 is what makes the dataset pathological. Contrary to what you would expect from the ordinal rating scale, it has high probability of being in the lowest or highest rating category, but low probability of being in the middle. (Let’s imagine it is one of those polarizing flavors like cilantro, where “you love it or you hate it” but few people are in the middle!)
Observed data plot
Conditional effects plot for proportional odds model
Fitting a proportional odds model to these data, with fixed effect of variety (and a random intercept and slope term for grower) produces very bad predictions, as we would expect.
Conditional effects plot for partial proportional odds model
However, relaxing the proportional odds assumption by wrapping the fixed variety term in cs()
, to fit a so-called partial proportional odds model, doesn’t fully fix the problem. It looks like there is something in the model structure that is constraining the relative ordering of the conditional probabilities to be the same across the three varieties. The point estimate of p(rating=B) is greater than the point estimate of p(rating=C) and p(rating=A) for all three varieties. What I want is a model that will allow the estimate of the conditional probability of each rating to vary by variety so that p(rating=B) can be greatest in some varieties and lowest in others. Is this possible with this class of model in brms? Or is it necessary to create a custom Stan model?
Code to reproduce example data, fit models, and make plots
library(brms)
library(dplyr)
library(ggplot2)
exdata <- structure(list(Grower = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L, 5L,
5L, 5L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 7L, 7L, 7L, 7L, 7L, 7L,
7L, 7L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 9L, 9L, 9L, 9L, 9L, 9L,
9L, 9L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 11L, 11L, 11L,
11L, 11L, 11L, 11L, 11L, 11L, 12L, 12L, 12L, 12L, 12L, 12L, 12L,
12L), levels = c("1", "2", "3", "4", "5", "6", "7", "8", "9",
"10", "11", "12"), class = "factor"),
Variety = structure(c(1L,
1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L,
1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 1L, 1L, 1L, 2L, 2L, 3L, 3L,
3L, 1L, 1L, 2L, 2L, 3L, 3L, 3L, 1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L,
1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L, 1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L,
1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L, 1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L,
1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 1L, 1L, 1L, 2L, 2L, 3L, 3L,
3L), levels = c("1", "2", "3"), class = "factor"),
Rating = structure(c(3L,
2L, 1L, 3L, 2L, 1L, 3L, 2L, 1L, 3L, 2L, 1L, 3L, 2L, 3L, 2L, 1L,
3L, 2L, 1L, 3L, 2L, 1L, 3L, 2L, 1L, 3L, 2L, 1L, 2L, 1L, 3L, 2L,
1L, 2L, 1L, 2L, 1L, 3L, 2L, 1L, 3L, 2L, 1L, 3L, 2L, 3L, 2L, 1L,
3L, 2L, 1L, 2L, 1L, 3L, 2L, 1L, 3L, 2L, 1L, 2L, 1L, 3L, 2L, 1L,
3L, 2L, 1L, 3L, 2L, 3L, 2L, 1L, 3L, 2L, 1L, 3L, 2L, 3L, 2L, 1L,
3L, 2L, 1L, 3L, 2L, 1L, 3L, 2L, 1L, 3L, 2L, 1L, 2L, 1L, 3L, 2L,
1L), levels = c("C", "B", "A"), class = c("ordered", "factor"
)),
N_Plants = c(20L, 16L, 5L, 1L, 36L, 4L, 4L, 2L, 37L, 21L,
15L, 13L, 8L, 36L, 30L, 2L, 11L, 26L, 14L, 3L, 6L, 35L, 1L, 12L,
3L, 30L, 2L, 22L, 25L, 28L, 16L, 8L, 2L, 37L, 8L, 41L, 17L, 29L,
2L, 1L, 46L, 6L, 14L, 22L, 24L, 24L, 30L, 1L, 11L, 36L, 9L, 2L,
43L, 4L, 32L, 5L, 6L, 13L, 18L, 11L, 45L, 1L, 39L, 2L, 4L, 23L,
21L, 3L, 9L, 32L, 36L, 1L, 5L, 10L, 18L, 17L, 15L, 29L, 29L,
4L, 15L, 32L, 10L, 5L, 2L, 42L, 1L, 30L, 4L, 12L, 3L, 9L, 29L,
28L, 12L, 10L, 3L, 30L)), class = "data.frame", row.names = c(NA,
-98L))
# Plot raw data (probability of each rating at each grower)
observed_p <- exdata %>%
group_by(Grower, Variety) %>%
mutate(p = N_Plants/sum(N_Plants))
ggplot(observed_p, aes(x = Variety, group = interaction(Variety, Rating), color = Rating, y = p)) +
geom_point(position = position_dodge(width = .5)) +
theme_bw() +
scale_color_discrete()
# Proportional odds model
fit_po <- brm(
Rating ~ Variety + (1 + Variety | Grower),
family = cumulative(link = 'logit', threshold = 'flexible'),
data = exdata,
seed = 130
)
# Model relaxing proportional odds assumption
fit_ppo <- brm(
Rating ~ cs(Variety) + (1 + Variety | Grower),
family = cumulative(link = 'logit', threshold = 'flexible'),
data = exdata,
seed = 130
)
conditional_effects(fit_po, categorical = TRUE)
conditional_effects(fit_ppo, categorical = TRUE)