LOOCV with OLS

My sample is really small n=44. To run LOOCV, I use 43 observations to train, and then the left out 1 sample for test. I repeat this procedure for 44 times so that each of sample has a chance to be used as a test data.

I don’t remember where I read that using LOOCV is known to be not that (theoretically) desirable for OLS. I guess in practice, it would be okay. So, with OLS is it a bad idea to use LOOCV? If yes, why?

For those not down with the acronyms, LOOCV is leave one out cross-validation and OLS is ordinary least squares.

Stan doesn’t do OLS. If you run Stan’s optimization, it just optimizes the log density—it doesn’t use a least squares algorithm.

I don’t work with OLS, so I wouldn’t have come across this warning. But as far as I know, there’s nothing special about OLS as an approximate inference technique that would preclude using LOOCV. There are the usual dangers of cross-validation, which is underestimating performance (you train on 43 data items rather than 44 and you hold out the one you don’t train on, unlike say, when you do a bootstrap estimate and it samples with replacement) and the danger of overfitting by running too many times.

Of course, if you fit a maximum likelihood estimate, you shouldn’t expect the estimates themselves to be well calibrated—you need to use Bayesian posteriors for that (assuming the data comes from the generative process you outlined).

1 Like

Dear Bob Carpenter

Thanks for your response.

It’s always useful to consider the minimum adequate sample size, even if it’s a frequentist approximation to a Bayesian procedure. The sample size needed to estimate the residual variance in a linear model is n=70 with a reasonable margin of error. Fitting a model with n=43 is likely to be an unstable process with wide uncertainty intervals.

1 Like

Dear Frank Harrell,

We can think of a single sample and an estimate of the mean as the simplest of all models. The sample SD is in fact the SD of the residuals (from the mean). Of COURSE, with more data, we get better estimates (of anything). But, I don’t understand the cut-off/rule of n=70.

Best,
Sacha

See Regression Modeling Strategies - 4  Multivariable Modeling Strategies and Biostatistics for Biomedical Research - 5  Statistical Inference

The Richard Riley et al references are key. You can compute the sample size needed just to estimate the intercept, with no predictors whatsoever. I doubt that n=44 is adequate even for that.

Dear Frank Harrell,

I really thank you a lot for the references.

You set a specific goal, which led to 70…
Specifically, you said suppose you want to make a 95% confidence interval for sigma with bounds on the order of +/- 20% (your multiplicative 1.2), n = 70 (or so) would be required. Sure; I’ll buy that. That’s not the same as saying you need 70 to estimate sigma… For instance, if you had said +/-30%, the requirement would be quite smaller than 70; insist on +/- 10%, and it would go up…

Best,
Sacha

Replace “to estimate sigma” with “to adequately estimate sigma”, although I could see using a multiplicative margin of error as high as 1.3.

If the Bayesian model is correct (a big if!), we’ll get calibrated inference (in expectation!) for any size of data. We just turn the crank and get as much information as we can out of the combination of data and prior. It’s always important to consider prior plus data in Bayes as our posterior is a combination of both mediated by the choice of likelihood—we can’t just look at number of data points like in the classical setting.

Often, for posterior predictive inference, we don’t need tight estimates of parameters (or residuals, etc.).

Thanks Bob. The great thing about Bayes (if things converge) is getting exact inference for all sample sizes. The sample size then is an issue of how much you can learn from the data, and how accurate is out-of-sample prediction when there may be different distributions/collinearities of X. Maybe the simplest way to talk about it is to look at the posterior distribution of R^2. This distribution will be accurate, but so wide when N is small and p is large that we have learned very little. And if the sample size is too small to get narrow posterior distributions for the unknown mean for the simplest possible model (intercept only), it’s pretty much impossible for a model with covariates to result in sharp enough knowledge to be useful.