Multiple attemps to run k-fold cross-validation fail with brms

Good day,

I have been trying to run k-fold cross-validation on a brms model without success. Unfortunately, it is difficult to provide a reproducible example because my model is run on a large dataset that is private. I will try to explain the problem the best as I can and provide the code.

  1. The first attempt I made was to run my model and save it in an .rds file, and then run the k_fold() function from brms on a separate script. I got the exact same issue reported in the topic ["Error: Model not compiled." when using brms::kfold with cmdstanr backend].
    This is the message I got:
Fitting model 1 out of 10
Fitting model 2 out of 10
Fitting model 3 out of 10
Fitting model 4 out of 10
Fitting model 5 out of 10
Fitting model 6 out of 10
Fitting model 7 out of 10
Fitting model 8 out of 10
Fitting model 9 out of 10
Fitting model 10 out of 10
Start sampling
Error: Model not compiled. Try running the compile() method first.
  1. I tried to circumvent problem 1 by updating brms, cmdstan, and cmdstanr to the latest versions, and then fit the model and the kfold function within the same script (same R session) as some people suggested ["Error: Model not compiled." when using brms::kfold with cmdstanr backend]. However, I now encounter a similar problem but the error message is different :
Recompiling the Stan model
Fitting model 1 out of 10
Fitting model 2 out of 10
Fitting model 3 out of 10
Fitting model 4 out of 10
Fitting model 5 out of 10
Fitting model 6 out of 10
Fitting model 7 out of 10
Fitting model 8 out of 10
Fitting model 9 out of 10
Fitting model 10 out of 10
Running MCMC with 1 chain, with 10 thread(s) per chain...

The desired updates require recompiling the model
Start sampling
Error in cmdstanr::read_cmdstan_csv(out$output_files(), variables = "",  :
  Assertion on 'files' failed: No file provided.
Calls: kfold ... .kfold -> <Anonymous> -> value.Future -> signalConditions

I don’t understand what is going on and what I could do to make the function work. I additionnaly encounter this warning, but I don’t know if it’s coming from kfold or maybe the future package:

In addition: Warning message:
In bins[xids] <- rep(1:K, ceiling(N/K))[1:N] :
  number of items to replace is not a multiple of replacement length

Here is the code I used to run the model and the kfold function :

# ==========================================================================

#                       Base hunting success analysis                      #

# ==========================================================================

# Code to run the linear hunting success analysis
# This script was run on Calcul Canada's Cedar supercomputer

# This model quantifies the relationship between hunting behaviour and hunting success. 
# Hunting success = number of prey captured.

# Detect number of cores
options(mc.cores = parallel::detectCores())
# -----------------------------------------------------------------------





# ==========================================================================
# 1. Load libraries and import the data 
# ==========================================================================


# Packages -----------------------------------------------------------------

library(data.table)
library(brms)
library(cmdstanr)



# Import the data ----------------------------------------------------------

# Folder path Compute Canada
folder <- file.path("/home", "maxime11", "projects", "def-monti", 
                    "maxime11", "phd_project", "data")

# Import the data
data <- fread(file.path(folder, "merged-data2021.csv"),
              select = c("player_id", "match_id",
                         "hunting_success", "map_name", 
                         "game_duration", "speed",
                         "space_covered_rate",
                         "prox_mid_PreyGuarding",
                         "hook_start_time"),
                         stringsAsFactors = TRUE)

# Add observation-level random effect
data$obs <- 1:nrow(data)

# ==========================================================================
# ==========================================================================





# ==========================================================================
# 2. Prepare variables for the model
# ==========================================================================


# Transform ----------------------------------------------------------------

data[, ":=" (prox_mid_PreyGuarding = log(prox_mid_PreyGuarding + 1),
             hook_start_time = log(hook_start_time + 1),
             game_duration = sqrt(game_duration))]



# Standardise the variables (Z-scores) -------------------------------------

standardize <- function (x) {(x - mean(x, na.rm = TRUE)) / 
                              sd(x, na.rm = TRUE)}

data[, c("Zgame_duration", "Zspeed",
         "Zspace_covered_rate", "Zprox_mid_PreyGuarding",
         "Zhook_start_time") :=
                lapply(.SD, standardize), 
                .SDcols = c(5:9)]

# ==========================================================================
# ==========================================================================





# ==========================================================================
# 3. Build the model
# ==========================================================================


# Set priors ---------------------------------------------------------------

priors <- c(
  # priors on fixed effects
  set_prior("normal(0, 2)",
            class = "b"),
  # priors on var. parameters (brms automatically detects half-normal)
  set_prior("normal(0, 1)",
            class = "sd") # applies to all variance parameters
            )



# linear model formula -----------------------------------------------------

model_formula <- brmsformula(hunting_success | trials(4) ~
                                        Zspeed +
                                        Zspace_covered_rate +
                                        Zprox_mid_PreyGuarding +
                                        Zhook_start_time +
                                        Zgame_duration +
                                        (1 | map_name) +
                                        (1 | player_id) +
                                        (1 | obs))

# ==========================================================================
# ==========================================================================





# ==========================================================================
# 4. Run the model
# ==========================================================================


# Model specifications -----------------------------------------------------

base_model <- brm(formula = model_formula,
                  family = binomial(link = "logit"),
                  warmup = 1000, 
                  iter = 3000,
                  thin = 8,
                  chains = 4, 
                  inits = "0", 
                  threads = threading(10),
                  backend = "cmdstanr",
                  seed = 123,
                  prior = priors,
                  control = list(adapt_delta = 0.95),
                  save_pars = save_pars(all = TRUE),
                  sample_prior = TRUE,
                  data = data)



# Save the model object ----------------------------------------------------

saveRDS(base_model, file = "03B_hunting_success_base-model1small.rds")



# Capture the session ------------------------------------------------------

session <- sessionInfo()
capture.output(session, file = "session-hunting_success-models_small.txt")

# ==========================================================================
# ==========================================================================





# =======================================================================
# 5. Perform the k-fold cross-validation
# =======================================================================

# Load the future package to parallelize the folds
library(future)
plan(multicore)

options(future.globals.maxSize = 4000000000)

cv_object <- kfold(base_model,
                   K = 10,
                   Ksub = NULL,
                   folds = "stratified",
                   group = "player_id",
                   chains = 1)

saveRDS(cv_object, file = "kfoldcv-base_model1small.rds")

# =======================================================================
# =======================================================================

Here is my session info :

R version 4.0.2 (2020-06-22)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS/LAPACK: /cvmfs/soft.computecanada.ca/easybuild/software/2020/Core/imkl/2020.1.217/compilers_and_libraries_2020.1.217/linux/mkl/lib/intel64_lin/libmkl_gf_lp64.so

locale:
 [1] LC_CTYPE=en_CA.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_CA.UTF-8        LC_COLLATE=en_CA.UTF-8
 [5] LC_MONETARY=en_CA.UTF-8    LC_MESSAGES=en_CA.UTF-8
 [7] LC_PAPER=en_CA.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] cmdstanr_0.4.0    brms_2.16.3       Rcpp_1.0.5        data.table_1.13.4

loaded via a namespace (and not attached):
  [1] nlme_3.1-151         matrixStats_0.57.0   xts_0.12.1
  [4] threejs_0.3.3        rstan_2.21.2         tensorA_0.36.2
  [7] tools_4.0.2          backports_1.2.1      R6_2.5.0
 [10] DT_0.16              mgcv_1.8-33          DBI_1.1.0
 [13] projpred_2.0.2       colorspace_2.0-0     withr_2.3.0
 [16] tidyselect_1.1.0     gridExtra_2.3        prettyunits_1.1.1
 [19] processx_3.4.5       Brobdingnag_1.2-6    curl_4.3
 [22] compiler_4.0.2       cli_2.2.0            shinyjs_2.0.0
 [25] colourpicker_1.1.0   posterior_1.1.0      scales_1.1.1
 [28] dygraphs_1.1.1.6     checkmate_2.0.0      mvtnorm_1.1-1
 [31] ggridges_0.5.2       callr_3.5.1          stringr_1.4.0
 [34] digest_0.6.27        StanHeaders_2.21.0-7 minqa_1.2.4
 [37] base64enc_0.1-3      pkgconfig_2.0.3      htmltools_0.5.0
 [40] lme4_1.1-26          fastmap_1.0.1        htmlwidgets_1.5.3
 [43] rlang_0.4.9          shiny_1.5.0          farver_2.0.3
 [46] generics_0.1.0       zoo_1.8-8            jsonlite_1.7.2
 [49] crosstalk_1.1.0.1    gtools_3.8.2         dplyr_1.0.2
 [52] distributional_0.2.2 inline_0.3.17        magrittr_2.0.1
 [55] loo_2.4.1            bayesplot_1.8.0      Matrix_1.2-18
 [58] munsell_0.5.0        fansi_0.4.1          abind_1.4-5
 [61] lifecycle_0.2.0      stringi_1.5.3        MASS_7.3-53
 [64] pkgbuild_1.2.0       plyr_1.8.6           grid_4.0.2
 [67] parallel_4.0.2       promises_1.1.1       crayon_1.3.4
 [70] miniUI_0.1.1.1       lattice_0.20-41      splines_4.0.2
 [73] knitr_1.30           ps_1.5.0             pillar_1.4.7
 [76] igraph_1.2.6         boot_1.3-25          markdown_1.1
 [79] shinystan_2.5.0      reshape2_1.4.4       codetools_0.2-18
 [82] stats4_4.0.2         rstantools_2.1.1     glue_1.4.2
 [85] V8_3.4.0             RcppParallel_5.0.2   nloptr_1.2.2.2
 [88] vctrs_0.3.5          httpuv_1.5.4         gtable_0.3.0
 [91] purrr_0.3.4          assertthat_0.2.1     ggplot2_3.3.2
 [94] xfun_0.19            mime_0.9             xtable_1.8-4
 [97] coda_0.19-4          later_1.1.0.1        rsconnect_0.8.17
[100] tibble_3.0.4         shinythemes_1.2.0    gamm4_0.2-6
[103] statmod_1.4.35       ellipsis_0.3.1       bridgesampling_1.1-2

I hope we can find a solution to the problem and thank you very much for your help.

Maxime

1 Like