Brms::kfold irregularly crashing with brm models using the cmdstanr backend

Hi all,

When trying to run K-fold cross-validation (brms::kfold) on a brms::brm model that uses the cmdstanr backend, approximately 50% of attempts (i.e, kfold(fit) calls) result in the following error (or a very similar error - only differing in the arg number, e.g., arg 2, arg 7, etc.):

Error in .fun(.x1, .x2, .x3, .x4, .x5, .x6, .x7, .x8, .x9, .x10) : 
  number of rows of matrices must match (see arg 3)

I cannot observe any pattern for attempts that succeed (i.e., kfold(fit) calls that complete the CV) or fail (i.e., kfold(fit) calls that begin to run but then crash with the above error). I tried to run the kfold(fit) call for the same model (fit) 20 times - 10 times on my local Mac and 10 times on a Linux-based server - and I experienced successes and failures on both machines (success:failure ratio; 4:6 and 6:4). So the error does not seem to be system-dependent.

Please, can anyone advise on what could be causing the error and how to avoid (or address) it? Perhaps @paul.buerkner could advise on what is going on? Thanks very much! I need to perform K-fold CV for several relatively large models that are likely to run for a significant time and would like to minimise the risk of the CV process crashing due to this issue (the crash occurs near the end of the process).

I am attaching a simple, reproducible example that gives me the error. I am also copy-pasting the output from Console showing the error further below.

Many thanks for your help.

Best wishes,

Tom

R and packages’ versions:
The Mac uses R version 4.1.0; macOS Catalina 10.15.7. Package versions: cmdstan (2.28.2); cmdstanr (0.4.0); brms (2.16.3); loo (2.4.1), future (1.23.0).
The server uses R version 4.1.2; Platform: x86_64-pc-linux-gnu (64-bit). Package versions: cmdstan (2.27.0); cmdstanr (0.4.0); brms (2.16.1); loo (2.4.1), future (1.23.0).

Reproducible example:

library(brms)
library(cmdstanr)
library(future)

# example from the brms vignette + cmdstanr backend
fit1 <- brm(count ~ zAge + zBase * Trt + (1|patient) + (1|obs),
            data = epilepsy, family = poisson(),
            backend = "cmdstanr")

# use the future package for parallelization
#plan(multiprocess) # gives the same error
plan(multisession)  # gives the same error
kfold(fit1)

Console output (here, the second call to kfold(fit1) crashed - see the bottom of the output; however, please note that sometimes it is the very first call that crashes and gives the error):

> kfold(fit1)
Fitting model 1 out of 10
Fitting model 2 out of 10
Fitting model 3 out of 10
Fitting model 4 out of 10
Fitting model 5 out of 10
Fitting model 6 out of 10
Fitting model 7 out of 10
Fitting model 8 out of 10
Fitting model 9 out of 10
Fitting model 10 out of 10
Running MCMC with 4 sequential chains...

Chain 1 finished in 5.5 seconds.
Chain 2 finished in 5.5 seconds.
Chain 3 finished in 5.8 seconds.
Chain 4 finished in 6.0 seconds.

All 4 chains finished successfully.
Mean chain execution time: 5.7 seconds.
Total execution time: 23.3 seconds.
Start sampling
Running MCMC with 4 sequential chains...

Chain 1 finished in 5.9 seconds.
Chain 2 finished in 5.7 seconds.
Chain 3 finished in 5.8 seconds.
Chain 4 finished in 5.9 seconds.

All 4 chains finished successfully.
Mean chain execution time: 5.8 seconds.
Total execution time: 23.5 seconds.
Start sampling
Running MCMC with 4 sequential chains...

Chain 1 finished in 5.8 seconds.
Chain 2 finished in 5.5 seconds.
Chain 3 finished in 5.8 seconds.
Chain 4 finished in 6.1 seconds.

All 4 chains finished successfully.
Mean chain execution time: 5.8 seconds.
Total execution time: 23.7 seconds.
Start sampling
Running MCMC with 4 sequential chains...

Chain 1 finished in 6.3 seconds.
Chain 2 finished in 6.0 seconds.
Chain 3 finished in 6.1 seconds.
Chain 4 finished in 6.4 seconds.

All 4 chains finished successfully.
Mean chain execution time: 6.2 seconds.
Total execution time: 25.4 seconds.
Start sampling
Running MCMC with 4 sequential chains...

Chain 1 finished in 6.7 seconds.
Chain 2 finished in 7.7 seconds.
Chain 3 finished in 7.9 seconds.
Chain 4 finished in 6.7 seconds.

All 4 chains finished successfully.
Mean chain execution time: 7.2 seconds.
Total execution time: 29.3 seconds.
Start sampling
Running MCMC with 4 sequential chains...

Chain 1 finished in 6.3 seconds.
Chain 2 finished in 6.6 seconds.
Chain 3 finished in 6.5 seconds.
Chain 4 finished in 6.5 seconds.

All 4 chains finished successfully.
Mean chain execution time: 6.5 seconds.
Total execution time: 26.4 seconds.
Start sampling
Running MCMC with 4 sequential chains...

Chain 1 finished in 5.4 seconds.
Chain 2 finished in 5.2 seconds.
Chain 3 finished in 5.4 seconds.
Chain 4 finished in 5.6 seconds.

All 4 chains finished successfully.
Mean chain execution time: 5.4 seconds.
Total execution time: 22.0 seconds.
Start sampling
Running MCMC with 4 sequential chains...

Chain 1 finished in 6.7 seconds.
Chain 2 finished in 7.4 seconds.
Chain 3 finished in 6.0 seconds.
Chain 4 finished in 5.9 seconds.

All 4 chains finished successfully.
Mean chain execution time: 6.5 seconds.
Total execution time: 26.3 seconds.
Start sampling
Running MCMC with 4 sequential chains...

Chain 1 finished in 5.8 seconds.
Chain 2 finished in 5.4 seconds.
Chain 3 finished in 5.5 seconds.
Chain 4 finished in 5.5 seconds.

All 4 chains finished successfully.
Mean chain execution time: 5.5 seconds.
Total execution time: 22.7 seconds.
Start sampling
Running MCMC with 4 sequential chains...

Chain 1 finished in 5.7 seconds.
Chain 2 finished in 5.8 seconds.
Chain 3 finished in 5.2 seconds.
Chain 4 finished in 5.3 seconds.

All 4 chains finished successfully.
Mean chain execution time: 5.5 seconds.
Total execution time: 22.3 seconds.
Start sampling

Based on 10-fold cross-validation

           Estimate   SE
elpd_kfold   -616.5 17.2
p_kfold       129.3 10.6
kfoldic      1233.0 34.5

> kfold(fit1)
Fitting model 1 out of 10
Fitting model 2 out of 10
Fitting model 3 out of 10
Fitting model 4 out of 10
Fitting model 5 out of 10
Fitting model 6 out of 10
Fitting model 7 out of 10
Fitting model 8 out of 10
Fitting model 9 out of 10
Fitting model 10 out of 10
Running MCMC with 4 sequential chains...

Chain 1 finished in 6.8 seconds.
Chain 2 finished in 6.4 seconds.
Chain 3 finished in 6.4 seconds.
The remaining chains had a mean execution time of 25.9 seconds.
Start sampling
Running MCMC with 4 sequential chains...

Chain 1 finished in 6.5 seconds.
Chain 2 finished in 6.2 seconds.
Chain 3 finished in 6.9 seconds.
Chain 4 finished in 6.3 seconds.

All 4 chains finished successfully.
Mean chain execution time: 6.5 seconds.
Total execution time: 26.3 seconds.
Start sampling
Running MCMC with 4 sequential chains...

Chain 1 finished in 6.0 seconds.
Chain 2 finished in 5.9 seconds.
Chain 3 finished in 6.3 seconds.
The remaining chains had a mean execution time of 24.5 seconds.
Start sampling
Running MCMC with 4 sequential chains...

Chain 1 finished in 5.6 seconds.
Chain 2 finished in 6.0 seconds.
Chain 3 finished in 5.8 seconds.
Chain 4 finished in 5.9 seconds.

All 4 chains finished successfully.
Mean chain execution time: 5.8 seconds.
Total execution time: 23.7 seconds.
Start sampling
Running MCMC with 4 sequential chains...

Chain 1 finished in 6.8 seconds.
Chain 2 finished in 5.9 seconds.
Chain 3 finished in 7.2 seconds.
Chain 4 finished in 6.6 seconds.

All 4 chains finished successfully.
Mean chain execution time: 6.6 seconds.
Total execution time: 27.0 seconds.
Start sampling
Running MCMC with 4 sequential chains...

Chain 1 finished in 8.4 seconds.
Chain 2 finished in 8.3 seconds.
Chain 3 finished in 6.2 seconds.
Chain 4 finished in 6.5 seconds.

All 4 chains finished successfully.
Mean chain execution time: 7.3 seconds.
Total execution time: 29.7 seconds.
Start sampling
Running MCMC with 4 sequential chains...

Chain 1 finished in 6.2 seconds.
Chain 2 finished in 5.7 seconds.
Chain 3 finished in 7.8 seconds.
Chain 4 finished in 6.3 seconds.

All 4 chains finished successfully.
Mean chain execution time: 6.5 seconds.
Total execution time: 26.3 seconds.
Start sampling
Running MCMC with 4 sequential chains...

Chain 1 finished in 5.6 seconds.
Chain 2 finished in 5.6 seconds.
Chain 3 finished in 5.5 seconds.
Chain 4 finished in 5.4 seconds.

All 4 chains finished successfully.
Mean chain execution time: 5.5 seconds.
Total execution time: 22.6 seconds.
Start sampling
Running MCMC with 4 sequential chains...

Chain 1 finished in 7.5 seconds.
Chain 2 finished in 6.6 seconds.
Chain 3 finished in 6.9 seconds.
Chain 4 finished in 7.8 seconds.

All 4 chains finished successfully.
Mean chain execution time: 7.2 seconds.
Total execution time: 29.2 seconds.
Start sampling
Running MCMC with 4 sequential chains...

Chain 1 finished in 6.2 seconds.
Chain 2 finished in 6.4 seconds.
Chain 3 finished in 6.2 seconds.
Chain 4 finished in 6.3 seconds.

All 4 chains finished successfully.
Mean chain execution time: 6.3 seconds.
Total execution time: 25.3 seconds.
Start sampling
Error in .fun(.x1, .x2, .x3, .x4, .x5, .x6, .x7, .x8, .x9, .x10) : 
  number of rows of matrices must match (see arg 2)
1 Like