Kfold(): “Error: New factor levels are not allowed”

tldr

With brms 2.8.0, I now get the following error message when using kfold()

Error: New factor levels are not allowed.
Levels allowed: '1', '2', '4', '5', '7', '9', '10', '11', '12', '13', '14', '15', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '35', '36', '37', '38', '39', '40', '41', '42', '43', '44', '45', '46', '47', '48'

In the past when I used kfold() on this exact model, I didn’t get an error. What’s changed and how should I be using the function?

Here are some details

The model is the first in McElreath’s chapter 12. Here’s the code.

# load the packages and get the data
library(tidyverse)
library(rethinking)

data(reedfrogs)
d <- reedfrogs
rm(reedfrogs)

detach(package:rethinking, unload = T)
library(brms)

# adjust the data
d <- 
  d %>%
  mutate(tank = 1:nrow(d))

# fit the model
b12.1 <- 
  brm(data = d, family = binomial,
 surv | trials(density) ~ 0 + factor(tank),
 prior(normal(0, 5), class = b),
 iter = 2000, warmup = 500, chains = 4, cores = 4,
 seed = 12)

If you use the loo() function, you get a warning.

Found 45 observations with a pareto_k &gt; 0.7 in model 'b12.1'. With this many problematic observations, it may be more appropriate to use 'kfold' with argument 'K = 10' to perform 10-fold cross-validation rather than LOO.

In prior versions of brms kfold(b12.1, K = 10) worked fine. Now I get the error message from above. I thought maybe since I was using 0 + factor(tank) to fit tank-specific intercepts that a adding a group argument would be the answer. However, when I try kfold(b12.1, group = "factor(tank)", cores = 4), I get this warning.

Error: New factor levels are not allowed.
Levels allowed: '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '36', '37', '38', '39', '40', '41', '42', '43', '44', '45', '46', '47', '48'
Levels found: '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '36', '37', '38', '39', '40', '41', '42', '43', '44', '45', '46', '47', '48'

What am I missing?

I would generally recommend not using factor inside a model formula but rather apply it beforehand. However, this is unrelated to you problem I believe.

brms in generaly can’t handle predictions for new levels of a factor which is included as a “fixed” effect. This is because we have no way to determing the regression coefficient for this missing level if it wasn’t present in the original data. The fact that this didn’t came up before was likely a problem of some internal brms checks failing and I am glad we get this error message now. To allow predicting new levels, specify your factor as a varying effect via (1 | tank).

Okay, I follow. As it turns out, the next model in the chapter does just that.

b12.2 <- 
  brm(data = d, family = binomial,
      surv | trials(density) ~ 1 + (1 | tank),
      prior = c(prior(normal(0, 1), class = Intercept),
                prior(cauchy(0, 1), class = sd)),
      iter = 4000, warmup = 1000, chains = 4, cores = 4,
      seed = 12)

And indeed kfold(b12.2, cores = 4) works just fine.

Backing up a bit, is there a way to use loo() or kfold() to compare these two models, then? Is it just the case that we conclude that because b12.2 worked with those functions and b12.1 didn’t that we’ll just prefer the well-behaving b12.2? Is there a more formal method?

You could stratify after group via loo::kfold_split_stratified(...) and then pass the result to the folds argument of kfold (if each group has enough observations so that each group can be in every fold).

Interesting.

In this case, the data are aggregated binomial. There are N = 48 tanks summarized by N = 48 rows in the data. The surv variable is the number of successes across trials, with the number of trials varying and specified by density.

If I follow your suggestion, that wouldn’t work in this instance. Each fold would have to contain all rows. Would it be the case, then, that when groups are specified as a fixed effect and the data are aggregated binomial, kfold() would not be applicable?

Yes, I would tend to say that. Unless you use unaggregated counts of course.

That makes sense. Thanks for the clarification, Paul.