Good day,
I have been trying to run k-fold cross-validation on a brms model without success. Unfortunately, it is difficult to provide a reproducible example because my model is run on a large dataset that is private. I will try to explain the problem the best as I can and provide the code.
- The first attempt I made was to run my model and save it in an .rds file, and then run the k_fold() function from brms on a separate script. I got the exact same issue reported in the topic ["Error: Model not compiled." when using brms::kfold with cmdstanr backend].
This is the message I got:
Fitting model 1 out of 10
Fitting model 2 out of 10
Fitting model 3 out of 10
Fitting model 4 out of 10
Fitting model 5 out of 10
Fitting model 6 out of 10
Fitting model 7 out of 10
Fitting model 8 out of 10
Fitting model 9 out of 10
Fitting model 10 out of 10
Start sampling
Error: Model not compiled. Try running the compile() method first.
- I tried to circumvent problem 1 by updating brms, cmdstan, and cmdstanr to the latest versions, and then fit the model and the kfold function within the same script (same R session) as some people suggested ["Error: Model not compiled." when using brms::kfold with cmdstanr backend]. However, I now encounter a similar problem but the error message is different :
Recompiling the Stan model
Fitting model 1 out of 10
Fitting model 2 out of 10
Fitting model 3 out of 10
Fitting model 4 out of 10
Fitting model 5 out of 10
Fitting model 6 out of 10
Fitting model 7 out of 10
Fitting model 8 out of 10
Fitting model 9 out of 10
Fitting model 10 out of 10
Running MCMC with 1 chain, with 10 thread(s) per chain...
The desired updates require recompiling the model
Start sampling
Error in cmdstanr::read_cmdstan_csv(out$output_files(), variables = "", :
Assertion on 'files' failed: No file provided.
Calls: kfold ... .kfold -> <Anonymous> -> value.Future -> signalConditions
I don’t understand what is going on and what I could do to make the function work. I additionnaly encounter this warning, but I don’t know if it’s coming from kfold or maybe the future package:
In addition: Warning message:
In bins[xids] <- rep(1:K, ceiling(N/K))[1:N] :
number of items to replace is not a multiple of replacement length
Here is the code I used to run the model and the kfold function :
# ==========================================================================
# Base hunting success analysis #
# ==========================================================================
# Code to run the linear hunting success analysis
# This script was run on Calcul Canada's Cedar supercomputer
# This model quantifies the relationship between hunting behaviour and hunting success.
# Hunting success = number of prey captured.
# Detect number of cores
options(mc.cores = parallel::detectCores())
# -----------------------------------------------------------------------
# ==========================================================================
# 1. Load libraries and import the data
# ==========================================================================
# Packages -----------------------------------------------------------------
library(data.table)
library(brms)
library(cmdstanr)
# Import the data ----------------------------------------------------------
# Folder path Compute Canada
folder <- file.path("/home", "maxime11", "projects", "def-monti",
"maxime11", "phd_project", "data")
# Import the data
data <- fread(file.path(folder, "merged-data2021.csv"),
select = c("player_id", "match_id",
"hunting_success", "map_name",
"game_duration", "speed",
"space_covered_rate",
"prox_mid_PreyGuarding",
"hook_start_time"),
stringsAsFactors = TRUE)
# Add observation-level random effect
data$obs <- 1:nrow(data)
# ==========================================================================
# ==========================================================================
# ==========================================================================
# 2. Prepare variables for the model
# ==========================================================================
# Transform ----------------------------------------------------------------
data[, ":=" (prox_mid_PreyGuarding = log(prox_mid_PreyGuarding + 1),
hook_start_time = log(hook_start_time + 1),
game_duration = sqrt(game_duration))]
# Standardise the variables (Z-scores) -------------------------------------
standardize <- function (x) {(x - mean(x, na.rm = TRUE)) /
sd(x, na.rm = TRUE)}
data[, c("Zgame_duration", "Zspeed",
"Zspace_covered_rate", "Zprox_mid_PreyGuarding",
"Zhook_start_time") :=
lapply(.SD, standardize),
.SDcols = c(5:9)]
# ==========================================================================
# ==========================================================================
# ==========================================================================
# 3. Build the model
# ==========================================================================
# Set priors ---------------------------------------------------------------
priors <- c(
# priors on fixed effects
set_prior("normal(0, 2)",
class = "b"),
# priors on var. parameters (brms automatically detects half-normal)
set_prior("normal(0, 1)",
class = "sd") # applies to all variance parameters
)
# linear model formula -----------------------------------------------------
model_formula <- brmsformula(hunting_success | trials(4) ~
Zspeed +
Zspace_covered_rate +
Zprox_mid_PreyGuarding +
Zhook_start_time +
Zgame_duration +
(1 | map_name) +
(1 | player_id) +
(1 | obs))
# ==========================================================================
# ==========================================================================
# ==========================================================================
# 4. Run the model
# ==========================================================================
# Model specifications -----------------------------------------------------
base_model <- brm(formula = model_formula,
family = binomial(link = "logit"),
warmup = 1000,
iter = 3000,
thin = 8,
chains = 4,
inits = "0",
threads = threading(10),
backend = "cmdstanr",
seed = 123,
prior = priors,
control = list(adapt_delta = 0.95),
save_pars = save_pars(all = TRUE),
sample_prior = TRUE,
data = data)
# Save the model object ----------------------------------------------------
saveRDS(base_model, file = "03B_hunting_success_base-model1small.rds")
# Capture the session ------------------------------------------------------
session <- sessionInfo()
capture.output(session, file = "session-hunting_success-models_small.txt")
# ==========================================================================
# ==========================================================================
# =======================================================================
# 5. Perform the k-fold cross-validation
# =======================================================================
# Load the future package to parallelize the folds
library(future)
plan(multicore)
options(future.globals.maxSize = 4000000000)
cv_object <- kfold(base_model,
K = 10,
Ksub = NULL,
folds = "stratified",
group = "player_id",
chains = 1)
saveRDS(cv_object, file = "kfoldcv-base_model1small.rds")
# =======================================================================
# =======================================================================
Here is my session info :
R version 4.0.2 (2020-06-22)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)
Matrix products: default
BLAS/LAPACK: /cvmfs/soft.computecanada.ca/easybuild/software/2020/Core/imkl/2020.1.217/compilers_and_libraries_2020.1.217/linux/mkl/lib/intel64_lin/libmkl_gf_lp64.so
locale:
[1] LC_CTYPE=en_CA.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_CA.UTF-8 LC_COLLATE=en_CA.UTF-8
[5] LC_MONETARY=en_CA.UTF-8 LC_MESSAGES=en_CA.UTF-8
[7] LC_PAPER=en_CA.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] cmdstanr_0.4.0 brms_2.16.3 Rcpp_1.0.5 data.table_1.13.4
loaded via a namespace (and not attached):
[1] nlme_3.1-151 matrixStats_0.57.0 xts_0.12.1
[4] threejs_0.3.3 rstan_2.21.2 tensorA_0.36.2
[7] tools_4.0.2 backports_1.2.1 R6_2.5.0
[10] DT_0.16 mgcv_1.8-33 DBI_1.1.0
[13] projpred_2.0.2 colorspace_2.0-0 withr_2.3.0
[16] tidyselect_1.1.0 gridExtra_2.3 prettyunits_1.1.1
[19] processx_3.4.5 Brobdingnag_1.2-6 curl_4.3
[22] compiler_4.0.2 cli_2.2.0 shinyjs_2.0.0
[25] colourpicker_1.1.0 posterior_1.1.0 scales_1.1.1
[28] dygraphs_1.1.1.6 checkmate_2.0.0 mvtnorm_1.1-1
[31] ggridges_0.5.2 callr_3.5.1 stringr_1.4.0
[34] digest_0.6.27 StanHeaders_2.21.0-7 minqa_1.2.4
[37] base64enc_0.1-3 pkgconfig_2.0.3 htmltools_0.5.0
[40] lme4_1.1-26 fastmap_1.0.1 htmlwidgets_1.5.3
[43] rlang_0.4.9 shiny_1.5.0 farver_2.0.3
[46] generics_0.1.0 zoo_1.8-8 jsonlite_1.7.2
[49] crosstalk_1.1.0.1 gtools_3.8.2 dplyr_1.0.2
[52] distributional_0.2.2 inline_0.3.17 magrittr_2.0.1
[55] loo_2.4.1 bayesplot_1.8.0 Matrix_1.2-18
[58] munsell_0.5.0 fansi_0.4.1 abind_1.4-5
[61] lifecycle_0.2.0 stringi_1.5.3 MASS_7.3-53
[64] pkgbuild_1.2.0 plyr_1.8.6 grid_4.0.2
[67] parallel_4.0.2 promises_1.1.1 crayon_1.3.4
[70] miniUI_0.1.1.1 lattice_0.20-41 splines_4.0.2
[73] knitr_1.30 ps_1.5.0 pillar_1.4.7
[76] igraph_1.2.6 boot_1.3-25 markdown_1.1
[79] shinystan_2.5.0 reshape2_1.4.4 codetools_0.2-18
[82] stats4_4.0.2 rstantools_2.1.1 glue_1.4.2
[85] V8_3.4.0 RcppParallel_5.0.2 nloptr_1.2.2.2
[88] vctrs_0.3.5 httpuv_1.5.4 gtable_0.3.0
[91] purrr_0.3.4 assertthat_0.2.1 ggplot2_3.3.2
[94] xfun_0.19 mime_0.9 xtable_1.8-4
[97] coda_0.19-4 later_1.1.0.1 rsconnect_0.8.17
[100] tibble_3.0.4 shinythemes_1.2.0 gamm4_0.2-6
[103] statmod_1.4.35 ellipsis_0.3.1 bridgesampling_1.1-2
I hope we can find a solution to the problem and thank you very much for your help.
Maxime