I am having trouble implementing k-fold cross-validation on a brms models with t2() smoothing terms that include 2 or 3 predictors. This does not happen for models with s() smoother terms with a single predictor inside, or models with linear terms for the 2-3 predictors. This code was working in September when brms was on version 2.17.0 and mgcv was on version 1.8-40, but now, with brms 2.18.0 and mgcv 1.8-41, the kfold will fail with models that include t2 terms. The brmsfit model itself works just fine.
I have data on the average standardized test scores across a number of schools, disaggregated by race. Each school has data for the average score of 1, 2, or 3 races. I’m interested in whether using a large number of school-level predictors improves our predictions of disparities in test scores as opposed to a simple univariate predictor (FRPL-the proportion of students eligible for free or reduced-price lunch, a common metric of economic need). I have used nonmetric multidimensional scaling (NMDS) to come up with 2 or 3 dimensional ordinations on a set of 12 predictors. I then use either FRPL, or the first 1, 2 or 3 ordination scores as predictors of test score, with race as a covariate, and a random effect of school.
library(brms)
library(loo)
library(future)
SATscores <- read.csv('SATscores.csv', stringsAsFactors = TRUE)
SATscores$School.ID <- as.factor(SATscores$School.ID)
#Fit Models
#Linear, 2 predictors
NMDS2lin <- brm(bf(SATMathAvg ~ (MDS1.Math + MDS2.Math)*Race + (1 | School.ID)), data = SATscores, family = gaussian(), save_pars = save_pars(all = TRUE), cores = 8)
#s smoother, 1 predictor
NMDS1s <- brm(bf(SATMathAvg ~ Race + s(MDS1.Math, m=2) + s(MDS1.Math, by = Race, m=1, bs="fs") + (1|School.ID)), data = SATscores, family = gaussian(), save_pars = save_pars(all = TRUE), iter = 4000, cores = 8, control = list(adapt_delta = 0.99))
#t2 smoother, 2 predictors
NMDS2t2 <- brm(bf(SATMathAvg ~ Race + t2(MDS1.Math, MDS2.Math, m=2) + t2(MDS1.Math, MDS2.Math, by = Race, m=1, bs="fs") + (1|School.ID)), data = SATscores, family = gaussian(), save_pars = save_pars(all = TRUE), iter = 4000, cores = 8, control = list(adapt_delta = 0.975))
#Kfold
#Linear, 2 predictors
NMDS2linsplit <- kfold_split_grouped(K = 10, x = NMDS2lin$data$School.ID)
plan(multisession, workers = 10)
NMDS2linkfold <- kfold(NMDS2lin, folds = NMDS2linsplit)
#s, 1 predictor
NMDS1ssplit <- kfold_split_grouped(K = 10, x = NMDS1s$data$School.ID)
plan(multisession, workers = 10)
NMDS1skfold <- kfold(NMDS2lin, folds = NMDS1ssplit)
#t2, 2 predictors
NMDS2t2split <- kfold_split_grouped(K = 10, x = NMDS2t2$data$School.ID)
plan(multisession, workers = 10)
NMDS2t2kfold <- kfold(NMDS2t2, folds = NMDS2t2split)
traceback()
Here is the error message, and the results of traceback()
Error in h(simpleError(msg, call)) :
error in evaluating the argument 'x' in selecting a method for function 't': the dims contain missing values
In addition: Warning messages:
1: Rows containing NAs were excluded from the model.
2: Rows containing NAs were excluded from the model.
3: Rows containing NAs were excluded from the model.
> traceback()
25: stop(condition)
24: signalConditions(future, exclude = getOption("future.relay.immediate",
"immediateCondition"), resignal = TRUE)
23: value.Future(futures[[ks]])
22: future::value(futures[[ks]])
21: .kfold(x = .x1, newdata = .x2, resp = .x3, model_name = .x4,
K = .x5, Ksub = .x6, folds = .x7, group = .x8, save_fits = .x9,
future_args = .x10)
20: eval(expr, envir, ...)
19: eval(expr, envir, ...)
18: eval2(call, envir = args, enclos = envir)
17: do_call(paste0(".", criterion), args)
16: .fun(criterion = .x1, K = .x2, Ksub = .x3, folds = .x4, group = .x5,
resp = .x6, save_fits = .x7, future_args = .x8, x = .x9,
model_name = .x10, use_stored = .x11)
15: eval(expr, envir, ...)
14: eval(expr, envir, ...)
13: eval2(call, envir = args, enclos = envir)
12: do_call(compute_loo, args)
11: .fun(models = .x1, criterion = .x2, K = .x3, Ksub = .x4, folds = .x5,
group = .x6, compare = .x7, resp = .x8, save_fits = .x9,
future_args = .x10, use_stored = .x11)
10: eval(expr, envir, ...)
9: eval(expr, envir, ...)
8: eval2(call, envir = args, enclos = envir)
7: do_call(compute_loolist, args)
6: kfold.brmsfit(NMDS2t2, folds = NMDS2t2split)
5: kfold(NMDS2t2, folds = NMDS2t2split) at Rep_Ex_SAT_Scores.R#32
4: eval(ei, envir)
3: eval(ei, envir)
2: withVisible(eval(ei, envir))
1: source("~/Library/CloudStorage/OneDrive-UniversityofConnecticut/Kfold problem rep ex/Rep_Ex_SAT_Scores.R",
echo = TRUE)
SATscores.csv (42.7 KB)
My best guess is that some object name from t2 is not being properly passed to kfold.brmsfit but I’m not sure, and I certainly could have made a different mistake. Thanks for your help!
Operating System: macOS 13.1
Interface Version: R version 4.2.2, brms_2.18.0
Compiler/Toolkit: