Projpred error: "factor has new levels"

I have fitted a logistic regression with a regularised horshoe prior using rstanarm. The number of covariates is over 7000 and a few of them are categorical.

One of the categorical covariates is race and has tree levels:

> data_train$RACE |> table()

                                    WHITE 
                                      719 
                BLACK OR AFRICAN AMERICAN 
                                       95 
                                    ASIAN 
                                       17 
                                  UNKNOWN 
                                        8 
         AMERICAN INDIAN OR ALASKA NATIVE 
                                        1 
NATIVE HAWAIIAN OR OTHER PACIFIC ISLANDER 
                                        1 

The code I am running is identical to that in projpred’s vignette:

Sys.time()
refm_obj <- get_refmodel(fit1)
Sys.time()

# Refit the reference model K times:
cv_fits <- run_cvfun(
  refm_obj,
  ### Only for the sake of speed (not recommended in general):
  K = 5
  ###
)
saveRDS(cv_fits, paste0("models/", p0, "/cv_fits.rds"))

# For running projpred's CV in parallel (see cv_varsel()'s argument `parallel`):
doParallel::registerDoParallel(ncores)

# Final cv_varsel() run:
cvvs <- cv_varsel(
  refm_obj,
  cv_method = "kfold",
  cvfits = cv_fits,
  ### Only for the sake of speed (not recommended in general):
  method = "L1",
  nclusters_pred = 20,
  ###
  nterms_max = 9,
  parallel = TRUE,
  ### In interactive use, we recommend not to deactivate the verbose mode:
  verbose = FALSE
  ### 
)

However, the call to run_cvfun() fails with the following error:

Fitting model 1 out of 5
Error in model.frame.default(Terms, newdata, xlev = object$xlevels) : 
  factor RACE has new levels AMERICAN INDIAN OR ALASKA NATIVE
Calls: run_cvfun ... pp_data -> .pp_data -> model.frame -> model.frame.default
In addition: Warning message:
Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.

Any idea?
Thanks

With only 1 observation with RACE = “AMERICAN INDIAN OR ALASKA NATIVE,” is it possible that a model is being fit on a data subset that includes no observations with RACE = “AMERICAN INDIAN OR ALASKA NATIVE” but then is trying to generate predictions on a holdout data subset that does contain an observation with RACE = “AMERICAN INDIAN OR ALASKA NATIVE”? Such a situation might be causing the error you’re seeing.

1 Like

Yes, this is the reason. Usually the predictive functions would be given an argument allow_new_levels=TRUE, but at the moment there is no easy way to pass it via run_cvfun(). You could do it by using init_refmodel() instead of get_refmodel(). Before trying that, you could test whether you get anything useful if you drop out the groups with too few observations. If that works, then you can spend a bit more time to figure out including all data.

Please report how your experiments did go, as it’s not common someone reports using projpred for that many covariates.

1 Like

Thanks, Aki. I just want to add that as far as I know, allow_new_levels only exists in brms (not in rstanarm, I think) and there, it only refers to variables for group-level terms (i.e., grouping variables such as grp in (1 | grp)), not to variables for population-level terms (I guess the factor RACE will have population-level effects). In lme4, the similarly named argument allow.new.levels also seems to refer to variables for group-level terms only.

2 Likes

Yeah rstanarm doesn’t have this argument, but it does allow new levels (no argument necessary).

1 Like

Yes, sorry, I forgot to mention that:

For both, brms and rstanarm reference models, the corresponding get_refmodel() methods (brms:::get_refmodel.brmsfit() and projpred:::get_refmodel.stanreg(), respectively) will ensure that new levels in grouping variables are allowed (e.g., in brms:::get_refmodel.brmsfit(), allow_new_levels = TRUE is set at the appropriate place and in projpred:::get_refmodel.stanreg(), this is also handled appropriately). So this shouldn’t require any action from the user.

If there are rare levels of a factor with population-level effects, it might make sense to use loo::kfold_split_stratified() to construct the folds manually (but – as pointed out above – if there is just a single observation for some level, this won’t help either).

3 Likes

Thank you to everybody who has replied. This was useful - I ended up removing the levels with few covariates. That covariate turned out not to be relevant anyway.

I will report on the results when I finish the experiments

VĂ­ctor

1 Like