Kfold predictions on an external test set

For a course I’m teaching I want to showcase the changes in predictive performance from training to (cross-validated) test set, to external test set (a new dataset with slightly different context of collection).

I was wondering whether there is an easy way to take a brms::kfold fitted model and make predictions on the external test set.
I can successfully fit the model. (m ← brm(…)
I can successfully kfold the model (kf_m ← kfold(m, K = 5, folds = “stratified”, group = “ID”))
I can successfully make predictions (kfp ← kfold_predict(kf_m))
But when I try to predict on a new dataset, the code fails (kfp_test ← kfold_predict(kf_m, newdata = TestData))

In particular, I get “Error in standata.brmsfit(.x1, resp = .x2, newdata = .x3, newdata = .x4, :
formal argument “newdata” matched by multiple actual arguments”.

I’m not sure whether I can access the 5 models within kfold and run the predictions manually, but in general it’d be neat to be able to run external predictions from kfold_predict()

1 Like

But why do you want to make predictions to the test set with K-fold-CV posteriors? K-fold-CV itself estimates how good predictions the full data posterior makes, and what you ask is something different. In non-Bayesian context, it can make more sense as conditioning on different data sets can stabilise the inference, but in Bayesian inference that is handled by the integration over the posterior. I guess this is the reason no-one has thought that kfold_predict(kf_m, newdata = TestData)) would be useful to make work.

thanks for the answer! Yes, I expect no gain there and I realize this is a niche use case, but it’s useful to me because:

  • I could do cv model selection, fit the full chosen model and predict a new dataset, but I’m trying to make the more general point that the more robust cv performance assessment (compared to the full model) is still contingent on the dataset being fully representative of the population at stake.
  • It makes it much easier to teach ML pipelines where we can replace the stan model with e.g. a random forest, without really changing the pipeline.
1 Like