I am trying to make sense of why brms/stan uses LOOCV to compare nested Bayesian hierarchical models. A simple example - suppose you calculate LOOCV for both models:

m1: outcome ~ predictor + (1|subject)

m2: outcome ~ (1 | subject)

Then, if there is very little within-subject variation, each “left out datapoint” will be easily predictable using the information from the other datapoints of that subject. E.g. suppose the data are

Subj Predictor Outcomes

S1 | 0 | 10.5, 10.6, 10.3, 10.1, 10.9

S2 | 0 | 21.2, 21.4, 21.5, 21.9, 21,3

S3 | 0 | 10.2, 10.3, 10.5, 10.6, 10.9

S4 | 0 | 12.3, 12.5, 12.9, 13.0, 13.1

S5 | 0 | 22.1, 22.1, 22.1, 23.1, 23.3

S6 | 0 | 14.2, 14.4, 14.5, 14.9, 14,3

S7 | 0 | 10.5, 20.6, 20.3, 20.1, 20.9

S8 | 0 | 21.2, 21.4, 21.5, 21.9, 21,3

S9 | 0 | 20.2, 20.3, 20.5, 20.6, 20.9

S10 | 0 | 22.3, 22.5, 22.9, 23.0, 23.1

S11 | 1 | 90.5, 90.6, 90.3, 90.1, 90.9

S12 | 1 | 91.2, 91.4, 91.5, 91.9, 91,3

S13 | 1 | 90.2, 90.3, 90.5, 90.6, 90.9

S14 | 1 | 92.3, 92.5, 92.9, 93.0, 93.1

S15 | 1 | 92.1, 92.1, 92.1, 93.1, 93.3

S16 | 1 | 94.2, 94.4, 94.5, 94.9, 94,3

S17 | 1 | 90.5, 90.6, 90.3, 90.1, 90.9

S18 | 1 | 91.2, 91.4, 91.5, 91.9, 91,3

S19 | 1 | 90.2, 90.3, 90.5, 90.6, 90.9

S20 | 1 | 92.3, 92.5, 92.9, 93.0, 93.1

In this data:

All scores for predictor=0 are between 10 and 23.9

All scores for predictor=1 are between 90 and 95.

=> the predictor plays an important role here

In addition, within each subject there is relatively little variation.

What I don’t understand is: If you run loocv on m2, then predicting each left-out observation can make use of the information from each cluster. E.g. if S3,Outcome=10.5 is left out then the model can still predict the value using the other 4 values of outcome, (10.2, 10.3, 10.6, 10.9) can be used to make a highly accurate prediction. The same goes for all other observations.

Now m1 has the additional predictor variable – but in the example above, this predictor variable doesn’t provide much additional information. There is more variation between subjects at each value of the predictor than within each subject.

So presumably loocv will report little additional benefit of m1 over m2 when predicting new data. But that is obviously wrong.

Obviously this is a contrived example – but I don’t get how the loocv procedure can say anything meaningful about comparisons between hierarchical models, especially with such nested models.

I’m sorry that this question is obviously missing something quite fundamental, but I haven’t found any solution.

Does loocv do anything to take this into account (e.g. “leave-one-cluster-out cv” or provide a diagnostic about this?)

Thanks