Kfold(): “Error: New factor levels are not allowed”

Solomon · March 31, 2019, 8:27pm

tldr

With brms 2.8.0, I now get the following error message when using kfold()

Error: New factor levels are not allowed.
Levels allowed: '1', '2', '4', '5', '7', '9', '10', '11', '12', '13', '14', '15', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '35', '36', '37', '38', '39', '40', '41', '42', '43', '44', '45', '46', '47', '48'

In the past when I used kfold() on this exact model, I didn’t get an error. What’s changed and how should I be using the function?

Here are some details

The model is the first in McElreath’s chapter 12. Here’s the code.

# load the packages and get the data
library(tidyverse)
library(rethinking)

data(reedfrogs)
d <- reedfrogs
rm(reedfrogs)

detach(package:rethinking, unload = T)
library(brms)

# adjust the data
d <- 
  d %>%
  mutate(tank = 1:nrow(d))

# fit the model
b12.1 <- 
  brm(data = d, family = binomial,
 surv | trials(density) ~ 0 + factor(tank),
 prior(normal(0, 5), class = b),
 iter = 2000, warmup = 500, chains = 4, cores = 4,
 seed = 12)

If you use the loo() function, you get a warning.

Found 45 observations with a pareto_k &gt; 0.7 in model 'b12.1'. With this many problematic observations, it may be more appropriate to use 'kfold' with argument 'K = 10' to perform 10-fold cross-validation rather than LOO.

In prior versions of brms kfold(b12.1, K = 10) worked fine. Now I get the error message from above. I thought maybe since I was using 0 + factor(tank) to fit tank-specific intercepts that a adding a group argument would be the answer. However, when I try kfold(b12.1, group = "factor(tank)", cores = 4), I get this warning.

Error: New factor levels are not allowed.
Levels allowed: '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '36', '37', '38', '39', '40', '41', '42', '43', '44', '45', '46', '47', '48'
Levels found: '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '36', '37', '38', '39', '40', '41', '42', '43', '44', '45', '46', '47', '48'

What am I missing?

paul.buerkner · March 31, 2019, 8:36pm

I would generally recommend not using factor inside a model formula but rather apply it beforehand. However, this is unrelated to you problem I believe.

brms in generaly can’t handle predictions for new levels of a factor which is included as a “fixed” effect. This is because we have no way to determing the regression coefficient for this missing level if it wasn’t present in the original data. The fact that this didn’t came up before was likely a problem of some internal brms checks failing and I am glad we get this error message now. To allow predicting new levels, specify your factor as a varying effect via (1 | tank).

Solomon · March 31, 2019, 8:51pm

Okay, I follow. As it turns out, the next model in the chapter does just that.

b12.2 <- 
  brm(data = d, family = binomial,
      surv | trials(density) ~ 1 + (1 | tank),
      prior = c(prior(normal(0, 1), class = Intercept),
                prior(cauchy(0, 1), class = sd)),
      iter = 4000, warmup = 1000, chains = 4, cores = 4,
      seed = 12)

And indeed kfold(b12.2, cores = 4) works just fine.

Backing up a bit, is there a way to use loo() or kfold() to compare these two models, then? Is it just the case that we conclude that because b12.2 worked with those functions and b12.1 didn’t that we’ll just prefer the well-behaving b12.2? Is there a more formal method?

paul.buerkner · March 31, 2019, 9:08pm

You could stratify after group via loo::kfold_split_stratified(...) and then pass the result to the folds argument of kfold (if each group has enough observations so that each group can be in every fold).

Solomon · March 31, 2019, 9:35pm

Interesting.

In this case, the data are aggregated binomial. There are N = 48 tanks summarized by N = 48 rows in the data. The surv variable is the number of successes across trials, with the number of trials varying and specified by density.

If I follow your suggestion, that wouldn’t work in this instance. Each fold would have to contain all rows. Would it be the case, then, that when groups are specified as a fixed effect and the data are aggregated binomial, kfold() would not be applicable?

paul.buerkner · April 1, 2019, 4:59am

Yes, I would tend to say that. Unless you use unaggregated counts of course.

Solomon · April 1, 2019, 11:44am

That makes sense. Thanks for the clarification, Paul.

BalthasarBickel · August 13, 2020, 1:19pm

I have a somewhat similar problem where I can’t use kfold() — or loo() for that matter — and so I am considering the WAIC as an alternative. But I like the Bayesian bootstrap that loo_model_weights() applies when weighting PSIS-derived elpds (‘elpd_loo’) for its regularization effects (Yao et al. 2018:925), and so I was wondering whether the Bayesian bootstrap couldn’t be used when weighting WAIC eldps (‘elpd_waic’) as well? I’d be grateful for any advice on this.

Topic		Replies	Views
Contrasts problem with "reloo" and "kfold" brms rstan , loo , bug	3	607	July 1, 2020
Brms and IRT - kfold brms fitting-issues , loo	2	534	October 23, 2020
What is the meaning of 'group' in kfold funciton of brms package? brms loo	2	643	October 9, 2019
Specify grouping factor for brms kfold cross-validation brms	6	860	June 7, 2020
Error: Object is not a stanreg object. - kfold function brms	1	398	May 22, 2019

Kfold(): “Error: New factor levels are not allowed”

tldr

Here are some details

Related topics