Is it possible to add a CV fold-dependent data-preprocessing step in `loo::kfold()` or the corresponding 'brms' method?

Is it possible to add a CV fold-dependent data-preprocessing step in loo::kfold() or a corresponding ‘brms’ method?

I want to estimate the predictive performance of a model (computed with ‘brms’) using loo::kfold() (or the method for brmsfit objects) as leave-group-out cross-validation and — and this is the key — where the training data are preprocessed depending on the training data/cross-validaton fold.

Specifically, the Bayesian model is a simple linear regression model where predictor variables are score values of a principal component analysis (PCA) (or some other dimension reduction approach) (such an approach has e.g. been suggested in Piironen and Vehtari (2017)). For the cross-validation, this means that for each cross-validation fold, I ideally have to:

  1. fit the PCA on the respective training data,
  2. extract the scores, say for the first 40, principal components, and
  3. use these as values for the predictor variables for the respective CV fold.

This is what I mean with a “CV fold-dependent data-preprocessing step” or “training data-dependent preprocessing step”.

As far as I can see, brms::kfold.brmsfit() does not support cross-validation where the training data are preprocessed dependeing on the CV fold. Is this correct?

You could follow the generic K-fold-CV example Holdout validation and K-fold cross-validation of Stan programs with the loo package • loo, and add the pre-processing there.

1 Like

Thanks, @avehtari !

I think this example will be helpful to code an own CV pipeline. I initially wanted to avoid the extra work to code an own CV pipeline, but with your example this should be similarly straightforward.

I then assume I have not misunderstood any argument in brms::kfold.brmsfit(), and that there is no such built-in option available.