Backend Errors when fitting many models in parallel

scholz · March 17, 2022, 10:41am

I am currently working on a simulation study that involves fitting millions of models. I initially ran into the problem that too many tmp files would be created so I moved creating a list of precompiled models and just updating them. I also try to run the study on as many cores as possible due to the scope.
During my first full test run, I started getting errors that point towards problems with the tmp files again:

When using rstan, the error message is: task 1 failed - "Failed to initialize module pointer: Error in FUN(X[[i]], ...): no such symbol _rcpp_module_boot_stan_fit4model300d93a0267b0__mod in package /tmp/RtmptFj6uW/file300d953583179.so
When using cmdstanr, I only get a system command 'stanc' failed, sterr empty
I am using brms as my frontend.
Below is a summary of the workflow of the code, however it would be too long to fully paste here. The repository is open though.

I am suspecting a Problem where all the processes write their tmp files into the same folder and trigger an early cleanup or something along those lines as I wasn’t able to reproduce the error when running the simulation on a single process.

Any help in regards to getting the multiprocessing to work would be greatly appreciated.

# Cluster Setup
cluster <- parallel::makeCluster(ncores, type = "PSOCK")
doParallel::registerDoParallel(cluster)
on.exit({ 
  try({
    doParallel::stopImplicitCluster()
    parallel::stopCluster(cluster)
  })
})
parallel::clusterEvalQ(cl = cluster, {
  library(brms)
  library(bayesim)
})

...

# Run a process for each seed that will be used to generate a dataset
# for the given configurations
`%dopar%` <- foreach::`%dopar%`
results <- foreach::foreach(
  parallel_seed = seed_list
) %dopar% {
  dataset_sim(
    data_geneneration_configuration = data_geneneration_configuration,
    fit_configurations = fit_configurations,
    prefits = prefits,
    numeric_metrics,
    predictive_metrics,
    seed = parallel_seed
  )
}

...
# Generate the dataset, and loop over all fit configurations to be fitted and
# get metrics calculated. Use the fitting prefit object to prevent recompilation
...

  fit <- stats::update(prefit,
    newdata = dataset,
    formula. = brms::brmsformula(fit_configuration$formula),
    refresh = 0,
    silent = 2,
    warmup = 500,
    iter = 2500,
    chains = 2,
    backend = "cmdstanr",
    seed = seed,
    init = 0.1
  )

sessionInfo()
R version 4.1.3 (2022-03-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Manjaro Linux

Matrix products: default
BLAS:   /usr/lib/libopenblasp-r0.3.20.so
LAPACK: /usr/lib/liblapack.so.3.10.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=de_DE.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=de_DE.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=de_DE.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] profvis_0.3.7       dplyr_1.0.8         gridExtra_2.3       brms_2.16.9         Rcpp_1.0.8          ggdag_0.2.4         bayesim_0.21.1.9000

loaded via a namespace (and not attached):
  [1] TH.data_1.1-0        minqa_1.2.4          colorspace_2.0-3     ellipsis_0.3.2       ggridges_0.5.3       estimability_1.3     markdown_1.1         base64enc_0.1-3     
  [9] farver_2.1.0         rstan_2.26.6         DT_0.21              fansi_1.0.2          mvtnorm_1.1-3        bridgesampling_1.1-2 codetools_0.2-18     splines_4.1.3       
 [17] doParallel_1.0.17    knitr_1.37           shinythemes_1.2.0    bayesplot_1.8.1      projpred_2.0.2       jsonlite_1.8.0       nloptr_2.0.0         shiny_1.7.1         
 [25] compiler_4.1.3       emmeans_1.7.2        backports_1.4.1      assertthat_0.2.1     Matrix_1.4-0         fastmap_1.1.0        cli_3.2.0            later_1.3.0         
 [33] htmltools_0.5.2      prettyunits_1.1.1    tools_4.1.3          igraph_1.2.11        coda_0.19-4          gtable_0.3.0         glue_1.6.2           reshape2_1.4.4      
 [41] posterior_1.2.0      V8_4.1.0             vctrs_0.3.8          nlme_3.1-155         iterators_1.0.14     crosstalk_1.2.0      tensorA_0.36.2       xfun_0.29           
 [49] stringr_1.4.0        ps_1.6.0             lme4_1.1-28          mime_0.12            miniUI_0.1.1.1       lifecycle_1.0.1      gtools_3.9.2         MASS_7.3-55         
 [57] zoo_1.8-9            scales_1.1.1         tidygraph_1.2.0      colourpicker_1.1.1   promises_1.2.0.1     Brobdingnag_1.2-7    parallel_4.1.3       sandwich_3.0-1      
 [65] inline_0.3.19        shinystan_2.6.0      gamm4_0.2-6          yaml_2.3.5           curl_4.3.2           ggplot2_3.3.5        loo_2.4.1            StanHeaders_2.26.6  
 [73] stringi_1.7.6        dygraphs_1.1.1.6     foreach_1.5.2        checkmate_2.0.0      boot_1.3-28          pkgbuild_1.3.1       cmdstanr_0.4.0       rlang_1.0.1         
 [81] pkgconfig_2.0.3      matrixStats_0.61.0   distributional_0.3.0 evaluate_0.15        lattice_0.20-45      purrr_0.3.4          rstantools_2.1.1     htmlwidgets_1.5.4   
 [89] processx_3.5.2       tidyselect_1.1.2     plyr_1.8.6           magrittr_2.0.2       bookdown_0.24        R6_2.5.1             generics_0.1.2       multcomp_1.4-18     
 [97] DBI_1.1.2            pillar_1.7.0         mgcv_1.8-39          xts_0.12.1           survival_3.2-13      abind_1.4-5          tibble_3.1.6         crayon_1.5.0        
[105] utf8_1.2.2           rmarkdown_2.12       grid_4.1.3           callr_3.7.0          threejs_0.3.3        digest_0.6.29        xtable_1.8-4         tidyr_1.2.0         
[113] httpuv_1.6.5         RcppParallel_5.1.5   stats4_4.1.3         munsell_0.5.0        shinyjs_2.1.0

scholz · March 18, 2022, 9:19am

Small update:
I tried seperating the compilation of the prefits and the rest of the simulation which lead to me opening and closing a cluster for the prefits and the rest and I noticed, that the tmp file folder that held the stan files of the prefit objects was emptied when the prefit cluster was closed.
While in the original code, the cluster was never closed between compiling the prefits and the rest of the simulation, I suspect that for some reason, the tmp file folter is emptied when the cluster is already opened before compiling the prefits, even if that compilation is run in the main r session.

I was able to solve the problem (for now, and without understanding the exact mechanism at work) by compiling the prefits before opening a cluster and only afterwards starting the whole multiprocessing part.

mitzimorris · March 21, 2022, 5:36pm

why don’t you compile the model once and pass around the exe everywhere?
I did this for a slightly different use case using CmdStanPy and slurm -

scholz · March 21, 2022, 6:07pm

That is what I was trying to do (if I understand right). I precompile all models and then use the fit objects to update them with the respective data/formulas I need.
But I run into the problem that the exe files are lost at some point, due to something clearing the tmp folder. So yes, I can get the exe files somewhere permanent but it still feels unexpected that the tmp folder is cleared for no reason apparent to me.

rok_cesnovar · March 21, 2022, 6:56pm

I am not sure why the temporary folder is cleared up. Nothing in cmdstanr/rstan do that.

In this case, I think the problem is that brms writes its model in the temporary folder instead of some non-tmp folder. You can use cmdstanr’s options to change that as that is not exposed in brms.

Topic		Replies	Views
Can't continue session after stan Error. cannot open file '/tmp/Rtmp1...' brms rstan , fitting-issues , brms	5	605	November 1, 2022
Fitting multiple STAN models in parallel: supplied CSV file is corrupt error Modeling	3	170	April 29, 2024
Looping over many fits fills hard drive RStan rstan	8	2538	January 8, 2019
Issue using rstan with BiocParallel and MulticoreParam back-end RStan rstan	6	906	January 25, 2021
Parallel runs with RStan RStan rstan	5	2321	May 11, 2022

Backend Errors when fitting many models in parallel

Related topics