Say you have a multilevel model that you ran for six chains with 2,000 warmup and 2,000 sampling and you’ve verified that the general suite of convergence diagnostics doesn’t indicate any clear problems that would suggest non-convergence.

Having verified that multiple chains arrive at approximately the same stationary answer for the initial fit, is it still necessary to use six chains for each of the k folds or would it be acceptable (or at least not wrong) to fit each of the folds with only a single chain, particularly in cases where doing so yields a much more computationally tractable solution due to more efficient parallelization (think hours versus days)?

I’ve been trying to find an answer to this question for a bit to no avail and there seems to be quite a bit of conflicting information on the subject. Any input is appreciated.

@avehtari will have a more definitive answer, but in the mean time…

We need the cross-validation posteriors to be correct, and so the question boils down to When and how might fitting fail for folds when it works for the full dataset? There are at least two potential failure modes:

The posterior geometry conditioning on the fold could be substantially nastier than the posterior geometry conditioning on all the data.

Fitting for a fold could fail stochastically. This could happen due to bad warmup, or due to a chain finding a minor mode and getting stuck there. The more folds you fit, the higher the probability that you’ll get unlucky in this way at least once, and if it happens even once (and you’re unable to diagnose it) it could mess up your inference substantially.

For certain classes of model, it might be reasonable to confidently assume that these problems are vanishingly rare, or are of the sort that will be caught by single-chain diagnostics if they are occurring at all.

I was afraid that may be the case. The issue I’ve run into is effectively that at least in the case of the brms implementation of kfold, parallization via future fits each of k models in parallel but runs each chain sequentially so performing k-fold cross validation with k = 5 (results in about 1300/7700 observations being held out per fold) and six chains per model results in the process taking approximately five times longer than the original model so using one or two chains results in a much more computationally efficient approach.

At least in theory, would also running the general convergence diagnostics for each of the folds and comparing them to model fitted using six chains be sufficient to detect potential problems (assuming I used two chains per model instead of just of just one)?

I suppose it may also be useful to note the model in question is a multilevel beta binomial model specification where the response is a bounded count and time is nested within countries (J = 166) (it’s common practice in my field to model the response as a proportion via OLS but since the denominator varies substantially within countries over time that approach fails to account for time-varying confounding and makes it impossible to tell whether observed effects of the intervention, which itself has both a direct and indirect effect on the denominator, are due to changes in the denominator over time or an actual change in the numerator).

The usual convergence diagnostics are not sufficient to guarantee to detect problems, and this holds in both cases when using many chains or when using one chain. If you are quite certain that the posteriors are unimodal and “easy”, one chain per fold might be enough for the purposes of cross-validation. If you are worried whether one chain is enough, unfortunately for you, the best advice we can give is to compare the result you get by running several chains.

Based on the one’s I’ve run so far I’m reasonably confident that multimodality isn’t an issue in this particular case and I think my main concern with running only a single chain would be that the effective sample size may be insufficient, particularly in the model that includes a varying slope (though it may be possible to address that by just running the one chain for a larger number of iterations). Since I generally prefer being overly cautious though, I’ll probably run each fold with four chains or so for the final version before my co-author and I submit it to a journal.

If the cross-validation shows big differences between models, then the effective sample size is probably not a big issue and you can run more iterations for the best full model to infer better the posterior of the quantities of interest.