I’m comparing several different statistical & machine learning techniques & want to evaluate all of them on cross-validated predictive performance, with the cross-validation folds being made based on the levels of a variable in the data (= site).
One of the models is a Bayesian “ridge” regression model with a hierarchical Gaussian prior on the predictor slopes. I want to cross-validate it via the
kfold.bmsfit function. However, I’m getting the following error when I try to run the function:
Error: Group 'SITE_ID_L' is not a valid grouping factor. Valid groups are: ''
I don’t really understand what that means. I’ve made sure that the variable is in the data when I’m fitting the model, however, since it’s not a predictor, the model object doesn’t contain the variable in the “data” element after it’s fitted. Could that be the issue? Or is it something completely different?
Pinging @paul.buerkner who is most likely to be able to answer but has probably missed this (or happens to be busy this week).
Hey, sorry I missed this post. Can you provide a minimal reproducible example for the problem?
Thank you Paul, here’s a reprex:
df <- map(1:20, ~ rnorm(200, 0, 1)) %>%
mutate(site = sample(letters[1:5], 200, replace = TRUE))
fit1 <- brm(V1 ~ . - site, data = df)
kfold(fit1, K = 5, folds = 'grouped', group = 'site')
You can only group by variables that have been in the model. This is why brms complains.
Right, that’s what I thought might be the problem. So does the variable always need to be included as a predictor? Or is there some way of getting around it?
The variable needs to be included in the model for group to be used. However, you can build your folds manually and pass them via the folds argument. That way, you can specify any partitioning you want.