Justifying PSIS-LOO model comparison for ordinal data in BRMS (with mathematical foundations)

Hi. I am trying to explain and justify a model comparison for ordinal ratings using BRMS. My approach is similar to the following:

library(ordinal) # For dataset
library(loo)
library(brms)

# Load the wine dataset
data(wine)

# Create 3 models with BRMS and compare
# Null model
nullmodel <- brm(rating ~ (1|judge), data = wine,
                    family = cumulative(threshold = "flexible"),
                    chains = 2, cores = 2, iter=2000)

# Group level temp
grouplevel <- brm(rating ~ temp + (1|judge), data = wine,
                    family = cumulative(threshold = "flexible"),
                    chains = 2, cores = 2, iter=2000)
# Individual level temp
indlevel <- brm(rating ~ (temp|judge), data = wine,
                    family = cumulative(threshold = "flexible"),
                    chains = 2, cores = 2, iter=2000)

# Loo
loo_nullmodel <- loo(nullmodel)
loo_grouplevel <- loo(grouplevel)
loo_indlevel <- loo(indlevel)

# Compare
loo_compare(loo_nullmodel,loo_grouplevel,loo_indlevel)

> loo_compare(loo_nullmodel,loo_grouplevel,loo_indlevel)
           elpd_diff se_diff
grouplevel   0.0       0.0  
indlevel    -4.3       2.5  
nullmodel  -14.7       4.7

In this example, I am finding that the grouplevel is likely the best model and that the null model, though simpler, is ~ 3.2 (14.7/4.7) Standard Deviations worse in its expected predictive power.

However, I now have a reviewer asking me: “How did you get the final score, it should be clearly explained to make it more understandable. The mathematic foundation should be well presented.”

My first attempt would be something as follows.

Model Comparison Using PSIS-LOO for Ordinal Data

To assess the predictive accuracy of competing models for our ordinal outcome, we employed Pareto-Smoothed Importance Sampling Leave-One-Out Cross-Validation (PSIS-LOO). When PSIS-LOO is used on CLMMs, it estimates the expected log pointwise predictive mass (elppm) rather than density, acknowledging the discrete nature of ordinal outcomes. This approach allows us to rigorously compare models by approximating how well each model predicts new observations, a critical aspect in determining the optimal model.

Computation of \text{elppm}_{\text{loo}}

In PSIS-LOO, the expected log pointwise predictive mass (\text{elppm}_{\text{loo}}) is calculated as:

\text{elppm}_{\text{loo}} = \sum_{i=1}^n \log \left(\frac{1}{S} \sum_{s=1}^S w_i^s p(y_i \mid \theta^s) \right)

where:

  • y_i denotes the observed ordinal response for observation i,
  • \theta^s is the $s$th posterior draw from the model’s parameter space,
  • w_i^s represents the importance weights stabilized through Pareto smoothing, and
  • p(y_i \mid \theta^s) is the probability mass function for the ordinal response given parameter values \theta^s.

PSIS-LOO provides reliable estimates even when the importance weights have a heavy right tail, a common occurrence when some observations are highly influential. The weights w_i^s are stabilized using Pareto smoothing to mitigate variability from such influential observations:

w_i^s = \frac{p(y_i \mid \theta^s)}{\frac{1}{S} \sum_{s=1}^S p(y_i \mid \theta^s)}

Probability Mass for Cumulative Ordinal Models

For our cumulative ordinal models, we specify the probability of observing each response category k as:

P(y_i = k \mid \theta) = F(\tau_k - \eta_i) - F(\tau_{k-1} - \eta_i)

where:

  • F is the cumulative logistic distribution function,
  • \tau_k and \tau_{k-1} are threshold parameters, and
  • \eta_i is the linear predictor for observation i.

This setup aligns with the cumulative ordinal family in BRMS, enabling flexible thresholds across categories.

Model Specifications and Comparison

We fitted three nested models to explore different levels of complexity in how temperature (an ordinal predictor) affects the rating (ordinal outcome):

  1. Null model: \text{rating} \sim (1 | \text{judge}), a baseline model without temperature,
  2. Group-level model: \text{rating} \sim \text{temp} + (1 | \text{judge}), including temperature as a fixed effect, and
  3. Individual-level model: \text{rating} \sim (\text{temp} | \text{judge}), allowing temperature to vary by judge.

Using PSIS-LOO via loo_compare(), we evaluated these models as follows:

Model elppm_diff SE_diff
Group-level 0.0 0.0
Individual-level -4.3 2.5
Null model -14.7 4.7

Interpreting the elppm Differences

The group-level model showed the highest predictive accuracy, suggesting that including temperature as a fixed effect improves predictive power without overfitting. Compared to the null model, the group-level model has an elppm difference of approximately 14.7, with a standard error of 4.7. Since this difference is substantially larger than twice its standard error, following guidelines in [1], we conclude that the group-level temperature model provides a better fit to the data than the null model.

The individual-level model also performed better than the null model, but the difference from the group-level model did not justify the increased complexity of allowing temperature effects to vary across judges.

Diagnostic Checks for Stability

PSIS-LOO diagnostics indicated that all Pareto k values were below the recommended threshold of 0.7, suggesting that importance sampling was reliable and that influential observations did not unduly affect model comparisons.

In summary, the group-level temperature model balances parsimony and predictive performance, making it the optimal choice for interpreting the influence of temperature on ordinal ratings.

References

  1. Vehtari, A., Gelman, A., & Gabry, J. (2017). Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC. Statistics and Computing, 27(5), 1413-1432.
  2. Vehtari, A., Simpson, D., Gelman, A., Yao, Y., & Gabry, J. (2024). Pareto smoothed importance sampling. Journal of Machine Learning Research, 25.
  3. BĂĽrkner, P.-C. (2017). brms: An R package for Bayesian multilevel models using Stan. Journal of Statistical Software, 80(1), 1-28.

All help with references as well as specific wording would be extremely welcome.

Thank you for reading.

I think it would be good to separate PSIS and LOO in the explanation.

It would be good to first explain LOO assuming computation just works. For the mathematical and decision theoretical foundation of LOO you can cite A survey of Bayesian predictive methods for model assessment, selection and comparison. The term elpd was introduced in Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC | Statistics and Computing, and you are correct using probability instead. For the mathematical foundation for elpd_diff and diff_se you can cite [2008.10296] Uncertainty in Bayesian Leave-One-Out Cross-Validation Based Model Comparison. So far there hasn’t been a need to discuss computation, and this separation should make it easier to see that the PSIS part doesn’t change the fact that you are just using leave-one-out cross-validation with log-score.

Then you can more briefly tell how PSIS is used to make the computation fast, and if your diagnostics indicate good accuracy for the computation you don’t need to discuss Pareto smoothing at all. Also now the way you describe Pareto-k diagnostic unnecessarily convolves the diagnostic for computation and possibility that rare influential observations might affect model comparison.

The equation here is self-normalization and not Pareto smoothing.

If you edit your explanation, I can comment again later, but I don’t have time in the next few days for writing any specific wording. CV-FAQ might have some useful wordings.

2 Likes

Thank you very much for your response, Aki. Updating it by segmenting PSIS and the ELPD part makes a lot of sense. Please let me know if you think it is clearer below.

Also to make my use case slightly clearer, I have 16 people who have rated a bunch of stimuli each 256 times. In that case, as I understand your references 2 SEs is reasonable given the number of data points provided all models especially if all Pareto k are sub 0.7?

I have tried to make things clearer below. (For some reason, I couldn’t update my original post—perhaps because it had been edited by a mod who added an additional tag?)

Model Comparison using LOO-CV for Ordinal Wine Ratings, Efficiently Approximated by PSIS

We compared Bayesian ordinal models predicting wine ratings based on temperature using leave-one-out cross-validation (LOO-CV) to assess out-of-sample predictive performance. This approach allows us to estimate how well each model generalizes to new, unseen data, a critical aspect of model selection.

1. Conceptual Foundation: Leave-One-Out Cross-Validation (LOO-CV)

LOO-CV evaluates a model’s predictive ability by sequentially holding out one data point, training the model on the remaining n-1 data points, and then predicting the held-out observation. This process is repeated for all n observations. For ordinal data like our wine ratings, we use the expected log pointwise predictive mass (elppm) to quantify predictive accuracy. The elppm reflects the probability assigned by the model to the observed ordinal category. A higher elppm indicates better predictive performance. (Vehtari et al., 2017; Vehtari & Ojanen, 2012)

Mathematically, the elppm estimated by LOO-CV (elppmloo) is:

elppmloo = (1/n) ÎŁ log p(yi | y-i)

where:

  • n is the total number of observations.

  • yi is the i-th observation.

  • y-i represents all data except yi.

  • p(yi | y-i) is the probability the model assigns to the observed category of yi, having been trained on y-i.

2. Model Comparison and Uncertainty:

Model comparison involves calculating the difference in elppmloo between two models (elppmdiff) and the standard error of this difference (SEdiff). A larger elppmdiff, relative to its SEdiff, suggests a practically meaningful difference in predictive accuracy, favoring the model with the higher elppmloo. (Vehtari et al., 2021) However, careful interpretation is crucial, especially with small sample sizes, when models make very similar predictions, or if there’s concern about model misspecification. (Sivula et al., 2020)

3. Efficient Computation with Pareto Smoothed Importance Sampling (PSIS)

Performing LOO-CV directly can be computationally intensive, requiring n model fits. Pareto Smoothed Importance Sampling (PSIS) provides an efficient approximation. PSIS uses importance sampling to re-weight posterior samples from a single full model fit to approximate the leave-one-out posterior distributions, avoiding the need for repeated model fitting. (Vehtari et al., 2024)

The reliability of the PSIS approximation is diagnosed using the Pareto shape parameter k. Values of k below 0.7 generally indicate reliable PSIS-LOO estimates.

4. Application to Ordinal Wine Ratings:

We fitted and compared three ordinal models:

  • Null Model: rating ~ (1|judge) (no effect of temperature)

  • Group-Level Model: rating ~ temp + (1|judge) (temperature as a fixed effect)

  • Individual-Level Model: rating ~ (temp|judge) (temperature effect varies by judge)

Model comparison was performed using Pareto smoothed importance sampling leave-one-out cross-validation (PSIS-LOO) implemented with the loo() and loo_compare() functions in the loo R package (Vehtari et al., 2017, 2022b). The results are summarized below:

Model elppm_diff SE_diff
Group-level 0.0 0.0
Individual-level -4.3 2.5
Null model -14.7 4.7

Interpreting the elppm Differences

The group-level model exhibited the highest estimated elppm, suggesting that incorporating temperature as a fixed effect improves predictions. The comparison with the null model yielded an elppm_diff of 14.7 (SE = 4.7). While this difference appears substantial, exceeding twice its standard error, relying solely on such a threshold can be misleading (Sivula et al., 2020; Vehtari et al., 2021). A more robust interpretation requires considering the dataset size (n - please provide this value) and the PSIS diagnostic, the Pareto shape parameter k (Vehtari et al., 2017).

Assuming the dataset is reasonably large (e.g., n > 100) and the Pareto k values for all models (particularly the group-level and null models) are below 0.7, indicating reliable PSIS-LOO estimates, the evidence supports the group-level model. A large elppm_diff combined with a relatively small standard error and satisfactory k diagnostics suggests a genuine improvement in predictive performance over the null model.

The individual-level model, while also outperforming the null model, does not show sufficient improvement over the group-level model to justify its added complexity. The smaller elppm_diff and larger standard error relative to the group-level model comparison suggest that allowing temperature effects to vary by judge does not substantially improve predictive accuracy in this case.

Sounds good.

This looks excellent!

I would change the model rating ~ (temp|judge) to rating ~ temp + (temp|judge). Without population level effect the prior on the varying effects is shrinking them towards zero, but gets wide to allow the temp effects and then the individual level effect estimates get more noisy. Adding the population level temp effect makes the model to be hierarchical and the prior for varying temp effects is shrinking towards the population effect.