Hi. I am trying to explain and justify a model comparison for ordinal ratings using BRMS. My approach is similar to the following:
library(ordinal) # For dataset
library(loo)
library(brms)
# Load the wine dataset
data(wine)
# Create 3 models with BRMS and compare
# Null model
nullmodel <- brm(rating ~ (1|judge), data = wine,
family = cumulative(threshold = "flexible"),
chains = 2, cores = 2, iter=2000)
# Group level temp
grouplevel <- brm(rating ~ temp + (1|judge), data = wine,
family = cumulative(threshold = "flexible"),
chains = 2, cores = 2, iter=2000)
# Individual level temp
indlevel <- brm(rating ~ (temp|judge), data = wine,
family = cumulative(threshold = "flexible"),
chains = 2, cores = 2, iter=2000)
# Loo
loo_nullmodel <- loo(nullmodel)
loo_grouplevel <- loo(grouplevel)
loo_indlevel <- loo(indlevel)
# Compare
loo_compare(loo_nullmodel,loo_grouplevel,loo_indlevel)
> loo_compare(loo_nullmodel,loo_grouplevel,loo_indlevel)
elpd_diff se_diff
grouplevel 0.0 0.0
indlevel -4.3 2.5
nullmodel -14.7 4.7
In this example, I am finding that the grouplevel is likely the best model and that the null model, though simpler, is ~ 3.2 (14.7/4.7) Standard Deviations worse in its expected predictive power.
However, I now have a reviewer asking me: “How did you get the final score, it should be clearly explained to make it more understandable. The mathematic foundation should be well presented.”
My first attempt would be something as follows.
Model Comparison Using PSIS-LOO for Ordinal Data
To assess the predictive accuracy of competing models for our ordinal outcome, we employed Pareto-Smoothed Importance Sampling Leave-One-Out Cross-Validation (PSIS-LOO). When PSIS-LOO is used on CLMMs, it estimates the expected log pointwise predictive mass (elppm) rather than density, acknowledging the discrete nature of ordinal outcomes. This approach allows us to rigorously compare models by approximating how well each model predicts new observations, a critical aspect in determining the optimal model.
Computation of \text{elppm}_{\text{loo}}
In PSIS-LOO, the expected log pointwise predictive mass (\text{elppm}_{\text{loo}}) is calculated as:
where:
- y_i denotes the observed ordinal response for observation i,
- \theta^s is the $s$th posterior draw from the model’s parameter space,
- w_i^s represents the importance weights stabilized through Pareto smoothing, and
- p(y_i \mid \theta^s) is the probability mass function for the ordinal response given parameter values \theta^s.
PSIS-LOO provides reliable estimates even when the importance weights have a heavy right tail, a common occurrence when some observations are highly influential. The weights w_i^s are stabilized using Pareto smoothing to mitigate variability from such influential observations:
Probability Mass for Cumulative Ordinal Models
For our cumulative ordinal models, we specify the probability of observing each response category k as:
where:
- F is the cumulative logistic distribution function,
- \tau_k and \tau_{k-1} are threshold parameters, and
- \eta_i is the linear predictor for observation i.
This setup aligns with the cumulative ordinal family in BRMS, enabling flexible thresholds across categories.
Model Specifications and Comparison
We fitted three nested models to explore different levels of complexity in how temperature (an ordinal predictor) affects the rating (ordinal outcome):
- Null model: \text{rating} \sim (1 | \text{judge}), a baseline model without temperature,
- Group-level model: \text{rating} \sim \text{temp} + (1 | \text{judge}), including temperature as a fixed effect, and
- Individual-level model: \text{rating} \sim (\text{temp} | \text{judge}), allowing temperature to vary by judge.
Using PSIS-LOO via loo_compare()
, we evaluated these models as follows:
Model | elppm_diff | SE_diff |
---|---|---|
Group-level | 0.0 | 0.0 |
Individual-level | -4.3 | 2.5 |
Null model | -14.7 | 4.7 |
Interpreting the elppm Differences
The group-level model showed the highest predictive accuracy, suggesting that including temperature as a fixed effect improves predictive power without overfitting. Compared to the null model, the group-level model has an elppm difference of approximately 14.7, with a standard error of 4.7. Since this difference is substantially larger than twice its standard error, following guidelines in [1], we conclude that the group-level temperature model provides a better fit to the data than the null model.
The individual-level model also performed better than the null model, but the difference from the group-level model did not justify the increased complexity of allowing temperature effects to vary across judges.
Diagnostic Checks for Stability
PSIS-LOO diagnostics indicated that all Pareto k values were below the recommended threshold of 0.7, suggesting that importance sampling was reliable and that influential observations did not unduly affect model comparisons.
In summary, the group-level temperature model balances parsimony and predictive performance, making it the optimal choice for interpreting the influence of temperature on ordinal ratings.
References
- Vehtari, A., Gelman, A., & Gabry, J. (2017). Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC. Statistics and Computing, 27(5), 1413-1432.
- Vehtari, A., Simpson, D., Gelman, A., Yao, Y., & Gabry, J. (2024). Pareto smoothed importance sampling. Journal of Machine Learning Research, 25.
- BĂĽrkner, P.-C. (2017). brms: An R package for Bayesian multilevel models using Stan. Journal of Statistical Software, 80(1), 1-28.
All help with references as well as specific wording would be extremely welcome.
Thank you for reading.