EDIT: I added three new sections (“Why use cross-validation?”, “How is cross-validation related to overfitting?”, “How to use cross-validation for model selection?”) and FAQ went over discourse 32000 character limit. The FAQ is now only at https://avehtari.github.io/modelselection/CV-FAQ.html. The list of contents is below.
I’ve made a quick (12+ hours) draft of cross-validation FAQ I’ve been planning to write. Due to recent feedback, I’m posting an early draft. It will be eventually linked, e.g., from loo package documentation. This FAQ is not trying to cover all possible cross-validation questions, and reflects what are the common questions in this forum or by the users of loo package. The provided references and references there in have much more details. Please comment or ask more in this thread.
Where to start?
What are all the acronyms and parts of cross-validation?
Why use cross-validation?
How is cross-validation related to overfitting?
How to use cross-validation for model selection?
When is cross-validation valid?
Can cross-validation be used for hierarchical / multilevel models?
Can cross-validation be used for time series?
Can other utilities or costs be used than log predictive density?
What is the interpretation of ELPD / elpd_loo?
Can cross-validation be used to compare different observation models / response distributions / likelihoods?
Is it a problem to mix discrete and continuous data types?
Why \sqrt{n} in Standard error (SE) of LOO?
What to do if I have many high Pareto k's?
Can I use PSIS-LOO if I have more parameters than observations?
What is the interpretation of p_loo?
What are the limitations of the cross-validation?
How are LOO and WAIC related?
How are LOOIC and elpd_loo related? Why LOOIC is -2*elpd_loo?
This will be a great resource! I think it would be great if you elaborated even more on this:
Because for us non-statisticians at least, it can be very useful to get the conceptual distinctions that may be obvious to others spelled out very clearly.
Can you help me? I had assumed my answers and references would answer these. If you could try to explain these in your words or ask more specific clarifications I could maybe see what concepts need more explanation and in what way. Without additional discussion I feel I would be just repeating what I have already written.
I’ll gladly try! I didn’t mean to imply that your answers and references do not address these issues. I just thought that this was a very helpful set of conceptual distinctions to get even before the useful and comprehensive list of abbreviations (which do contain much of the elaboration, but also gives a lot of detail at once). Not that this should be a comprehensive conceptual introduction, but a little of that is probably helpful to a lot of us. Sort of a prophylactic treatment of confusion. Perhaps the FAQ could start with something like (I’ve put around comments in italics, and a lot of sentences can probably be improved):
Some important distinctions
Differences between cross-validation methods can be separated into four categories. In order to think clearly about the different forms of cross-validation it’s important to distinguish between:
The way data is divided in cross-validation
Different partitions of data are held out in different kinds of cross-validation. A single observation in leave-one-out crossvalidation (LOO), all observations of one group in leave-one-group out (LOGO). Which unit is systematically left out determines the predictive task that cross-validation assesses model performance on. (Is that actually accurate for all forms of CV? Maybe something about how in the case of K-fold CV this concerns whether groups of observations are placed within the same fold?)
The utility or cost (Here is a topic where my statistical/mathemathical training is inadequate to suggest anything. It’s reasonable to assume that a lot of users are more familiar than me with these concepts, but it might still be good to point out how different forms of cross-validation can assess model performance on different metrics - and perhaps connect that to different purposes of modelling?)
The computational method used to compute leave-one-out predictive distributions
The choice of partitions to leave out or metric of model performance is independent of the computational method (e.g. PSIS or K-fold-CV). It’s easy to confuse the most common application of a computational method with the computational method in itself.
etc.
Or maybe the What are all the acronyms and parts of cross-validation? section could be organised even more explicitly around these headings.
I don’t know how helpful this was - I’m trying to think and write this with three children hanging around after school closed here yesterday due to the virus situation. Is my suggestion more comprehensible, at least?
Thanks @erognli. Any discussion is helpful as it is likely to make me think new thoughts. Thanks also for explicit suggestions, I’ll think about these.
I just wrote new parts “Why use cross-validation?”, “How is cross-validation related to overfitting?”, “How to use cross-validation for model selection?”, but I couldn’t add them as discourse has 32000 character limitation for a post! I need to take something away or get that github page to work.
Read through the updated version, trying to imagine my own understanding when I started trying to use cross-validation. This FAQ really increases accessibility of the method a lot, I think. I also believe it makes it easier to read the referenced papers, because you have some basic understanding to start off with. I really liked the new parts you added, and think the section 2 update also works very well.
Again, a great resource! Thanks for making it available.
edit: (And thanks for the generous acknowledgement - not necessary, but very kind.)
You mention overfit due to the selection process when comparing many models. Are there formal results I can read about this to understand the relationship between overfitting and the pool of models to select from?
Is there an easy way to calculate the Jacobian for a semi-arbitrary transformation using existing autodiff libraries?
Are there ever circumstances when you would compare models using MSE rather the likelihood?
Is there an approachable proof showing that, say, RMSE and log-likelihood are proper scoring rules? I’d like to have a better understanding of what a proper scoring rule is beyond having read the definition.
In practice, how much weaker is an exchangeability assumption than an independence assumption? I’m revisiting chapter 5 of BDA3 but don’t really have any intuition for this.
There is no generic result, because there are so many things that can have effect. It’s possible to have formal results for simplified cases, e.g. limited to variable selection for linear model with independent predictors and everything assumed to be Gaussian, but I don’t remember seeing them specifically for Bayesian case. For Lasso you can probably find with web search one paper (by younger Tibshirani if I remember correctly), but that is also for independent covariates. If you like more guidance for writing a paper on this, we can talk.
Hey, that’s not a cross-validation question! I don’t know answer, and to get an answer it might be better to ask it in a new thread.
If the application expert could convince me that decisions are based on point estimates and loss function really is squared error.
It’s difficult to know what is approachable, and I learned these probably from Bernardo & Smith (1994). For more modern material, based on a web search these are very popular
Gneiting & Raftery is great as it has many more scoring rules. I’d be happy to know if you find these approachable, and if not I can help to search for something more approachable.
Hi again,
I think I have two independent questions:
I don’t really understand what you mean by posterior dependencies. You wrote “Posterior dependencies exist without samples and can be non-linear with zero correlation.” which still doesn’t tell me much. Can you point me out to or reading material and some example?
I also want to know about nested models or Null hypothesis testing (you mention in twitter that it’s not the same). WhatI mean is that in my experience when I compare m1 with m2:
m1: Y ~ Normal(\alpha + \beta X, \sigma)
m2: Y ~ Normal(\alpha , \sigma)
even if beta has a clear effect of beta shown in the posterior distribution, psis-loo or kfold won’t in many cases show a clear advantage for m1. Wang, W. and Gelman, A. (2014) also show that if I remember correctly, but I was wondering if there is a formal explanation for why. And in which cases, CV will work in a comparison between m1 and m2.
I see that Shao 1993 kind of addresses my question, but they say “As expected, the CV(1) tends to select unnecessarily large models”. But I was wondering about the opposite situations as described in Wang & Gelman 2014, where the largest model (m1) should be the “correct” one, but CV doesn’t show much (or no) advantage in predictive accuracy against m2.
And I was still wondering what is the 2b that you mention in twitter.
Although that answer is only partial and we’ll soon have a new paper with more complete answer.
In twitter @bnicenboim asked: “And what about nested models? (Null hypothesis testing)”, and I answered " 2. Nested models are not the same as null hypothesis testing. 2a) see Shao (1993) and we soon have a new paper. 2b) The answer doesn’t fit in a tweet. Ask again in that discourse thread."
We can do lot of useful things with nested models without any null hypothesis testing. I guess there are different definitions for null hypothesis, but I go with the wikipedia: “the null hypothesis is a general statement or default position that there is no relationship between two measured phenomena or no association among groups.”. You can add your preferred definition and we can see if I need to change my answer.
I would not use cross-validation to test whether “there is no relationship between two measured phenomena”. For example. cross-validation can tell whether the current data and additional predictor in a nested model can predict better, but if cross-validation indicates that there is no improvement in the predictive accuracy that is not enough confirm that “there is no relationship” between the added predictor and the outcome. Often we assume that all predictors have some relevance, and the decision to use less than all predictors is a decision task including a cost for predictors (e.g. cost of future measurement or cost of explaining the more complex model).
Of course its is possible to evaluate how cross-validation would perform if it would be used for null hypothesis testing in simulated data with known zero relationships. Shao (1993) showed that if the model with the cross-validation predictive performance estimate is selected this approach is not consistent (ie selects the true model with probability going to 1 when n goes to infty). For predictive performance this irrelevant as when n goes to infty the all models with extra irrelevant predictors have the same predictive performance as the true model. In finite case, we can flip the example by instead of zero relevances have epsilon relevances where epsilon is arbitrarily small positive value. Now all predictors would be relevant, and null hypothesis is false but with finite data we can’t make a difference from zero relevance.
After this answer, can you elaborate your question, ie, what you would like to know about cross-validation, nested models and null hypothesis testing?
Thanks for the detailed answer! No need to change your answer because of the null hypothesis definition.
My point is very specific, just about the opposite situation of Shao (1993), that is, when there is a known relationship but small \beta\neq 0, but \beta is muuch smaller than \sigma:
m1: Y ~ Normal(\alpha + \beta X, \sigma)
m2: Y ~ Normal(\alpha , \sigma)
Many times, I find that there is no advantage for the more complex (m1) model in terms of elpd . (My simulations show that, and It’s also related to https://avehtari.github.io/modelselection/betablockers.html).
I guess the answer is related to this:
"Cross-validation is less useful for simple models with no posterior dependencies and assuming that simple model is not mis-specified. In that case the marginal posterior is less variable as it includes the modeling assumptions (which assume to be not mis-specified) while cross-validation uses non-model based approximation of the future data distribution which increases the variability. "
I’m just trying to understand the " non-model based approximation of the future data distribution". Don’t we use the likelihood at the out-of-sample data with the posterior based on N-1 datapoints? Why is it a non-model based approximation?
Each LOO predictive density p(\tilde{y}_i | y) is based on the model, but in LOO p_t(\tilde{y}) is not based on that same model or any other model which includes similar background information as our predictive model. In LOO p_t(\tilde{y}_i) is approximated with pseudo Monte Carlo draws from the future data distribution by re-using the observations (and using leave-out part to reduce the effect of the double use). We can say that this is then non-model based, or if you insist we can say that the model is, for example, Dirichlet model which however doesn’t include the usual additional smoothness assumptions, for example, normal distribution in regression etc. One of the weaknesses of cross-validation is the high variance in the approximation if the integral over unknown p_t(\tilde{y}_i).
So why don’t we use model for p_t(\tilde{y}_i) to reduce the variance? Actually, this is done sometimes. If we use the same model, the approach is called self-predictive and it works only if the model is very good and there is not much posterior uncertainty. If we use some very good model to model p_t(\tilde{y}_i) when evaluating simpler models, the approach is called reference predictive approach and projpred implements one that kind of approach. See more in A survey of Bayesian predictive methods for model assessment, selection and comparison.
If we have a simple model with now posterior dependencies, we can analyse p(\beta | y) directly and as it is model based there is less variance. The model filters the noise from y.
I had missed one of your two posts and noticed only the latter mentioning Shao. I think I answered in the later discussion your second question, but noticed now that you had also asked this. What I wrote has two parts
Posterior exists before we use sampling algorithm, so we don’t need to define dependencies via samples as you did in your twitter question. Also we may do the posterior inference with other than sampling based algorithms.
Ok, it took me a while, but I went through your papers and I finally understood this :)
Is this weakness explicit in one of your (or some other paper)? I want to give you proper credit. The paper about predictive approach has a section
“2.4 Why not to use cross-validation for selecting the feature combination?”
But this is not mentioned, and I don’t think this is explicit in A survey of Bayesian predictive methods for model assessment, selection and comparison…
Page 183 of that paper, paragraph “The effect of the data realization” has a sentence “M-open methods have generally higher variance than M-completed and M-closed methods due to their high variance in sample re-use (section 4.5).”