Number of Chains Necessary when Performing K-fold Cross Validation

ajnafa · February 12, 2022, 9:23pm

Say you have a multilevel model that you ran for six chains with 2,000 warmup and 2,000 sampling and you’ve verified that the general suite of convergence diagnostics doesn’t indicate any clear problems that would suggest non-convergence.

Having verified that multiple chains arrive at approximately the same stationary answer for the initial fit, is it still necessary to use six chains for each of the k folds or would it be acceptable (or at least not wrong) to fit each of the folds with only a single chain, particularly in cases where doing so yields a much more computationally tractable solution due to more efficient parallelization (think hours versus days)?

I’ve been trying to find an answer to this question for a bit to no avail and there seems to be quite a bit of conflicting information on the subject. Any input is appreciated.

jsocolar · February 12, 2022, 11:28pm

@avehtari will have a more definitive answer, but in the mean time…

We need the cross-validation posteriors to be correct, and so the question boils down to When and how might fitting fail for folds when it works for the full dataset? There are at least two potential failure modes:

The posterior geometry conditioning on the fold could be substantially nastier than the posterior geometry conditioning on all the data.
Fitting for a fold could fail stochastically. This could happen due to bad warmup, or due to a chain finding a minor mode and getting stuck there. The more folds you fit, the higher the probability that you’ll get unlucky in this way at least once, and if it happens even once (and you’re unable to diagnose it) it could mess up your inference substantially.

For certain classes of model, it might be reasonable to confidently assume that these problems are vanishingly rare, or are of the sort that will be caught by single-chain diagnostics if they are occurring at all.

ajnafa · February 13, 2022, 5:13am

I was afraid that may be the case. The issue I’ve run into is effectively that at least in the case of the brms implementation of kfold, parallization via future fits each of k models in parallel but runs each chain sequentially so performing k-fold cross validation with k = 5 (results in about 1300/7700 observations being held out per fold) and six chains per model results in the process taking approximately five times longer than the original model so using one or two chains results in a much more computationally efficient approach.

At least in theory, would also running the general convergence diagnostics for each of the folds and comparing them to model fitted using six chains be sufficient to detect potential problems (assuming I used two chains per model instead of just of just one)?

I suppose it may also be useful to note the model in question is a multilevel beta binomial model specification where the response is a bounded count and time is nested within countries (J = 166) (it’s common practice in my field to model the response as a proportion via OLS but since the denominator varies substantially within countries over time that approach fails to account for time-varying confounding and makes it impossible to tell whether observed effects of the intervention, which itself has both a direct and indirect effect on the denominator, are due to changes in the denominator over time or an actual change in the numerator).

avehtari · February 13, 2022, 4:43pm

Your answer is along what I would have answered.

The usual convergence diagnostics are not sufficient to guarantee to detect problems, and this holds in both cases when using many chains or when using one chain. If you are quite certain that the posteriors are unimodal and “easy”, one chain per fold might be enough for the purposes of cross-validation. If you are worried whether one chain is enough, unfortunately for you, the best advice we can give is to compare the result you get by running several chains.

ajnafa · February 13, 2022, 7:24pm

Based on the one’s I’ve run so far I’m reasonably confident that multimodality isn’t an issue in this particular case and I think my main concern with running only a single chain would be that the effective sample size may be insufficient, particularly in the model that includes a varying slope (though it may be possible to address that by just running the one chain for a larger number of iterations). Since I generally prefer being overly cautious though, I’ll probably run each fold with four chains or so for the final version before my co-author and I submit it to a journal.

avehtari · February 14, 2022, 3:50pm

If the cross-validation shows big differences between models, then the effective sample size is probably not a big issue and you can run more iterations for the best full model to infer better the posterior of the quantities of interest.

Topic		Replies	Views
Trouble with k-fold parallelization brms loo , rstanarm , paralellization	6	1706	November 4, 2021
Optimizing use of multiple cores brms paralellization	14	4346	September 23, 2020
Chains stuck when use larger dataset, but not smaller Modeling	15	4586	January 9, 2019
Projpred: Projection Predictive Variable Prediction, why "Performing selection for each fold" take really long time in cv_varsel() General	2	281	May 22, 2023
Multiple attemps to run k-fold cross-validation fail with brms brms fitting-issues , cmdstanr , cross-validation , brms	0	791	December 14, 2021

Number of Chains Necessary when Performing K-fold Cross Validation

Related topics