Kfold.brmsfit doesn't work when model includes t2 smoothing term

benwasserman · February 4, 2023, 9:28pm

I am having trouble implementing k-fold cross-validation on a brms models with t2() smoothing terms that include 2 or 3 predictors. This does not happen for models with s() smoother terms with a single predictor inside, or models with linear terms for the 2-3 predictors. This code was working in September when brms was on version 2.17.0 and mgcv was on version 1.8-40, but now, with brms 2.18.0 and mgcv 1.8-41, the kfold will fail with models that include t2 terms. The brmsfit model itself works just fine.

I have data on the average standardized test scores across a number of schools, disaggregated by race. Each school has data for the average score of 1, 2, or 3 races. I’m interested in whether using a large number of school-level predictors improves our predictions of disparities in test scores as opposed to a simple univariate predictor (FRPL-the proportion of students eligible for free or reduced-price lunch, a common metric of economic need). I have used nonmetric multidimensional scaling (NMDS) to come up with 2 or 3 dimensional ordinations on a set of 12 predictors. I then use either FRPL, or the first 1, 2 or 3 ordination scores as predictors of test score, with race as a covariate, and a random effect of school.

library(brms)
library(loo)
library(future)

SATscores <- read.csv('SATscores.csv', stringsAsFactors = TRUE)
SATscores$School.ID <- as.factor(SATscores$School.ID)

#Fit Models
#Linear, 2 predictors
NMDS2lin <- brm(bf(SATMathAvg ~ (MDS1.Math + MDS2.Math)*Race  + (1 | School.ID)),  data = SATscores, family = gaussian(), save_pars = save_pars(all = TRUE), cores = 8)
#s smoother, 1 predictor 
NMDS1s <- brm(bf(SATMathAvg ~ Race + s(MDS1.Math, m=2) + s(MDS1.Math, by = Race, m=1, bs="fs") + (1|School.ID)), data = SATscores, family = gaussian(), save_pars = save_pars(all = TRUE), iter = 4000, cores = 8, control = list(adapt_delta = 0.99))
#t2 smoother, 2 predictors
NMDS2t2 <- brm(bf(SATMathAvg ~ Race + t2(MDS1.Math, MDS2.Math, m=2) + t2(MDS1.Math, MDS2.Math, by = Race, m=1, bs="fs") + (1|School.ID)), data = SATscores, family = gaussian(), save_pars = save_pars(all = TRUE), iter = 4000, cores = 8, control = list(adapt_delta = 0.975))


#Kfold
#Linear, 2 predictors
NMDS2linsplit <- kfold_split_grouped(K = 10, x = NMDS2lin$data$School.ID)
plan(multisession, workers = 10)
NMDS2linkfold <- kfold(NMDS2lin, folds = NMDS2linsplit)

#s, 1 predictor
NMDS1ssplit <- kfold_split_grouped(K = 10, x = NMDS1s$data$School.ID)
plan(multisession, workers = 10)
NMDS1skfold <- kfold(NMDS2lin, folds = NMDS1ssplit)


#t2, 2 predictors
NMDS2t2split <- kfold_split_grouped(K = 10, x = NMDS2t2$data$School.ID)
plan(multisession, workers = 10)
NMDS2t2kfold <- kfold(NMDS2t2, folds = NMDS2t2split)

traceback()

Here is the error message, and the results of traceback()

Error in h(simpleError(msg, call)) : 
  error in evaluating the argument 'x' in selecting a method for function 't': the dims contain missing values
In addition: Warning messages:
1: Rows containing NAs were excluded from the model. 
2: Rows containing NAs were excluded from the model. 
3: Rows containing NAs were excluded from the model. 
> traceback()
25: stop(condition)
24: signalConditions(future, exclude = getOption("future.relay.immediate", 
        "immediateCondition"), resignal = TRUE)
23: value.Future(futures[[ks]])
22: future::value(futures[[ks]])
21: .kfold(x = .x1, newdata = .x2, resp = .x3, model_name = .x4, 
        K = .x5, Ksub = .x6, folds = .x7, group = .x8, save_fits = .x9, 
        future_args = .x10)
20: eval(expr, envir, ...)
19: eval(expr, envir, ...)
18: eval2(call, envir = args, enclos = envir)
17: do_call(paste0(".", criterion), args)
16: .fun(criterion = .x1, K = .x2, Ksub = .x3, folds = .x4, group = .x5, 
        resp = .x6, save_fits = .x7, future_args = .x8, x = .x9, 
        model_name = .x10, use_stored = .x11)
15: eval(expr, envir, ...)
14: eval(expr, envir, ...)
13: eval2(call, envir = args, enclos = envir)
12: do_call(compute_loo, args)
11: .fun(models = .x1, criterion = .x2, K = .x3, Ksub = .x4, folds = .x5, 
        group = .x6, compare = .x7, resp = .x8, save_fits = .x9, 
        future_args = .x10, use_stored = .x11)
10: eval(expr, envir, ...)
9: eval(expr, envir, ...)
8: eval2(call, envir = args, enclos = envir)
7: do_call(compute_loolist, args)
6: kfold.brmsfit(NMDS2t2, folds = NMDS2t2split)
5: kfold(NMDS2t2, folds = NMDS2t2split) at Rep_Ex_SAT_Scores.R#32
4: eval(ei, envir)
3: eval(ei, envir)
2: withVisible(eval(ei, envir))
1: source("~/Library/CloudStorage/OneDrive-UniversityofConnecticut/Kfold problem rep ex/Rep_Ex_SAT_Scores.R", 
       echo = TRUE)

SATscores.csv (42.7 KB)

My best guess is that some object name from t2 is not being properly passed to kfold.brmsfit but I’m not sure, and I certainly could have made a different mistake. Thanks for your help!

@paul.buerkner

Operating System: macOS 13.1
Interface Version: R version 4.2.2, brms_2.18.0
Compiler/Toolkit:

avehtari · February 6, 2023, 1:34pm

Although this was already tagged with brms, I also tag @paul.buerkner as this might be a bug

paul.buerkner · February 6, 2023, 6:19pm

Thanks! I will take a look.

paul.buerkner · February 7, 2023, 8:40am

I am unfortunately not sure either what is going on. Perhaps some mgcv internals(?)

I have started running your example on my machine but then stopped because it takes too long to run. Can you come up with a minimal reproducible example that takes just a few minutes to run until the error occurs?

benwasserman · February 8, 2023, 6:13pm

Thanks @paul.buerkner. Sorry, this should be a cleaner example. It took my 2019 Macbook pro about 5 minutes.

I further discovered that the problem seems to stem from the second t2 term with by = Group, bs = 'fs'. I wrote this to allow for the smoother for each group to have both shared and independent contributions after reading Hierarchical generalized additive models in ecology: an introduction with mgcv [PeerJ].

library(MASS)
library(tidyverse)
library(brms)
library(loo)
library(future)

##How many cores would you like to use for parallel processing?
setcores <- 8

##Simulate data
#Sample size
ss <- 50
#School-level predictors (NMDS)
schools <- data.frame(SchoolID = as.factor(1:ss), MDS = mvrnorm(n = ss, mu = c(0,0), Sigma = matrix(c(2,0,0,1), nrow = 2)))

#simulate scores for different groups, assume the relationship is linear, which hopefully means the brms model fits more easily than the real data
schools$GrpA <- rnorm(n = ss, mean = 500 + 10*schools$MDS.1, sd = 5)
schools$GrpB <- rnorm(n = ss, mean = 450 + 8*schools$MDS.1, sd = 5)
schools$GrpC <- rnorm(n = ss, mean = 450 + 8*schools$MDS.1, sd = 5)

schools <- schools %>% pivot_longer(cols = 4:6, names_to = "Group", values_to = "TestScore")
#ggplot(schools, aes(x = MDS.1, y = TestScore, color = Group)) + geom_point()


##Model of Interest: t2 with "random effect"   
#fit model
mod <- brm(bf(TestScore ~ Group + t2(MDS.1, MDS.2, m=2) + t2(MDS.1, MDS.2, by = Group, m=1, bs="fs") + (1|SchoolID)), data = schools, family = gaussian(), save_pars = save_pars(all = TRUE), iter = 2000, cores = setcores, control = list(adapt_delta = 0.8))
split <- kfold_split_grouped(K = 10, x = mod$data$SchoolID)
plan(multisession, workers = setcores)
modkfold <- kfold(mod, Ksub = 1, folds = split)


#You don't need to run this, but this is a similar model that does work
#Kfold can work unless the t2 term includes the parameter bs = "fs"
#mod2 <- brm(bf(TestScore ~ Group + t2(MDS.1, MDS.2, m=2) + (1|SchoolID)), data = schools, family = gaussian(), save_pars = save_pars(all = TRUE), iter = 2000, cores = setcores, control = list(adapt_delta = 0.8))
#split2 <- kfold_split_grouped(K = 10, x = mod2$data$SchoolID)
#plan(multisession, workers = setcores)
#modkfold2 <- kfold(mod2, Ksub = 1, folds = split2)

paul.buerkner · February 13, 2023, 10:44am

The problem seems to occur in mgcv::PredictMat. It may be that my preprocessing of the post-processing of mgcv splines is not correct for fs splines. The relevant preprocessinbg code is in brms::s2rPred . However, it was originally provided by Simon Wood himself and I have no idea how to edit it to make it work with fs splines in your case. Perhaps it is worth to ask Simon directly?

benwasserman · February 13, 2023, 3:07pm

Thanks. I will reach out to Simon Wood and report back.

Topic		Replies	Views
Specify grouping factor for brms kfold cross-validation brms	6	845	June 7, 2020
Cross-validation for cumulative model in brms brms	4	585	May 10, 2023
Kfold predictions on an external test set brms	2	689	October 3, 2022
Uncertainty on cross-validated prediction accuracy in a brms model Modeling specification , brms	0	171	February 12, 2024
Brms::kfold irregularly crashing with brm models using the cmdstanr backend brms	0	514	January 10, 2022

Kfold.brmsfit doesn't work when model includes t2 smoothing term

Related topics