Hi Stan forum,
An interesting paper was published recently in Science Advances that details how LOO-CV can introduce a (typically small) negative bias to model performance estimates. They propose a “rebalanced LOO-CV” that ensures that the overall mean of the training data remains constant (or as close as possible to this) across folds – image of their figure inset below. I’ve tried this out on my own data sets and have seen the same phenomenon when comparing results from stratified k-fold CV, LOO-CV, and their rebalanced LOO-CV method.
What implications (if any) might this have for performance estimates with loo? Would be interested to hear your thoughts, @avehtari, @jonah
1 Like
I have seen this paper before. This has been known issue for a long time. This bias is stronger when doing binary decision on classes, even so that all LOO predicted class can be wrong (also illustrated before). This bias is smaller when using smooth utility (cost) function like log score (see, e.g., Arlot and Celisse, 2010). LOO-CV log score estimates, elpd_loo, for two models with the same data are likely to have similar bias, and thus the bias in elpd_diff is even smaller. I have not run extensive simulations, but based on my experience, the potential bias in elpd_diff is swamped by the uncertainty in the comparison. If instead of using smooth continuous utility / cost per observation you are using binary utility / cost and then summarise by, e.g. classification accuracy or auROC, then the bias can be bigger and you may benefit from balancing. Traditionally the balancing has been done by leave-two-out cross-validation, but the approach proposed in this paper is fine, too.
2 Likes