I have fitted a logistic regression with a regularised horshoe prior using rstanarm. The number of covariates is over 7000 and a few of them are categorical.
One of the categorical covariates is race and has tree levels:
> data_train$RACE |> table()
WHITE
719
BLACK OR AFRICAN AMERICAN
95
ASIAN
17
UNKNOWN
8
AMERICAN INDIAN OR ALASKA NATIVE
1
NATIVE HAWAIIAN OR OTHER PACIFIC ISLANDER
1
The code I am running is identical to that in projpred’s vignette:
Sys.time()
refm_obj <- get_refmodel(fit1)
Sys.time()
# Refit the reference model K times:
cv_fits <- run_cvfun(
refm_obj,
### Only for the sake of speed (not recommended in general):
K = 5
###
)
saveRDS(cv_fits, paste0("models/", p0, "/cv_fits.rds"))
# For running projpred's CV in parallel (see cv_varsel()'s argument `parallel`):
doParallel::registerDoParallel(ncores)
# Final cv_varsel() run:
cvvs <- cv_varsel(
refm_obj,
cv_method = "kfold",
cvfits = cv_fits,
### Only for the sake of speed (not recommended in general):
method = "L1",
nclusters_pred = 20,
###
nterms_max = 9,
parallel = TRUE,
### In interactive use, we recommend not to deactivate the verbose mode:
verbose = FALSE
###
)
However, the call to run_cvfun() fails with the following error:
Fitting model 1 out of 5
Error in model.frame.default(Terms, newdata, xlev = object$xlevels) :
factor RACE has new levels AMERICAN INDIAN OR ALASKA NATIVE
Calls: run_cvfun ... pp_data -> .pp_data -> model.frame -> model.frame.default
In addition: Warning message:
Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
With only 1 observation with RACE = “AMERICAN INDIAN OR ALASKA NATIVE,” is it possible that a model is being fit on a data subset that includes no observations with RACE = “AMERICAN INDIAN OR ALASKA NATIVE” but then is trying to generate predictions on a holdout data subset that does contain an observation with RACE = “AMERICAN INDIAN OR ALASKA NATIVE”? Such a situation might be causing the error you’re seeing.
Yes, this is the reason. Usually the predictive functions would be given an argument allow_new_levels=TRUE, but at the moment there is no easy way to pass it via run_cvfun(). You could do it by using init_refmodel() instead of get_refmodel(). Before trying that, you could test whether you get anything useful if you drop out the groups with too few observations. If that works, then you can spend a bit more time to figure out including all data.
Please report how your experiments did go, as it’s not common someone reports using projpred for that many covariates.
Thanks, Aki. I just want to add that as far as I know, allow_new_levels only exists in brms (not in rstanarm, I think) and there, it only refers to variables for group-level terms (i.e., grouping variables such as grp in (1 | grp)), not to variables for population-level terms (I guess the factor RACE will have population-level effects). In lme4, the similarly named argument allow.new.levels also seems to refer to variables for group-level terms only.
For both, brms and rstanarm reference models, the corresponding get_refmodel() methods (brms:::get_refmodel.brmsfit() and projpred:::get_refmodel.stanreg(), respectively) will ensure that new levels in grouping variables are allowed (e.g., in brms:::get_refmodel.brmsfit(), allow_new_levels = TRUE is set at the appropriate place and in projpred:::get_refmodel.stanreg(), this is also handled appropriately). So this shouldn’t require any action from the user.
If there are rare levels of a factor with population-level effects, it might make sense to use loo::kfold_split_stratified() to construct the folds manually (but – as pointed out above – if there is just a single observation for some level, this won’t help either).
Thank you to everybody who has replied. This was useful - I ended up removing the levels with few covariates. That covariate turned out not to be relevant anyway.
I will report on the results when I finish the experiments