Parallelizing in Windows 11 is impossibly slow

Hello to everyone!

I used to run all my Rstan codes on a cluster to allow within-chain parallelization, since my machine had only two physical cores. I just got a new Lenovo machine, which is supposed to allow for much better parallelization, since it has 14 cores. Unfortunately, the code turns out to be hideously slow, and I don’t understand why. Also looking at the cores usage, it seems that many are used very very lightly.

Any explanation and/or solution to this? I have read somewhere that Windows can be slower for parallization, but this is not even comparable, we are talking many times slower. I wonder if it is even faster than just running on 1 core. It would seem stupid to have multiple cores on a windows machine if this was the situation with parallelization in Windows.

Best,
Luca

Enabling threading on Windows currently relies on a library called libwinpthreads, which has some very well-documented performance issues, particularly with thread-local storage, which Stan relies on for autodiff in multithreaded environments.

We have often observed that a model will sample ~twice as slow on Windows when threading is enabled.

You can still use the increased resources available to you by running each chain in it’s own isolated process (this is probably easier with cmdstan than rstan?). Also, just to check, when you say within-chain parallelization, are you using the explicit features in Stan for this (e.g. reduce_sum, map_rect)? Not much of Stan supports threading without these opt-in features

Hi!

First of all thanks for the detailed answer; yes I am currently using reduce_sum to parallelize within chain. If I understand correctly you are suggesting to use my machine to perform multiple chains, each on his own processor, this way I should not have a true computational boost in performance?

Reading around it seems that another suggestion would be running RStudio server on wsl, do you know if this actually does give a computational boost? Due to some features of my model under consideration I would prefer a single long chain to avoid some tedious post-processing.

Luca

Yes, WSL avoids these issues with multithreaded performance last time I checked. If that is an option for you I would recommend giving it a try

Hi @luchinoprince,

Not sure if this will help, but wanted to share: Have you enabled the options(mc.cores = parallel::detectCores())?

I noticed stan runs much faster without it on my Windows machine. It is helpful on my linux machine, but not on Windows. See example below. I solved a simple model twice on my Windows machine without using my 16 cores, then twice again using more cores. Much faster the first two times.

library(rstanarm)
#> Loading required package: Rcpp
#> This is rstanarm version 2.26.1
#> - See https://mc-stan.org/rstanarm/articles/priors for changes to default priors!
#> - Default priors may change, so it's safest to specify priors, even if equivalent to the defaults.
#> - For execution on a local, multicore CPU with excess RAM we recommend calling
#>   options(mc.cores = parallel::detectCores())
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

tictoc::tic()
stan_glm(mpg ~ wt, data = mtcars, refresh = 0)
#> stan_glm
#>  family:       gaussian [identity]
#>  formula:      mpg ~ wt
#>  observations: 32
#>  predictors:   2
#> ------
#>             Median MAD_SD
#> (Intercept) 37.3    2.0  
#> wt          -5.3    0.6  
#> 
#> Auxiliary parameter(s):
#>       Median MAD_SD
#> sigma 3.1    0.4   
#> 
#> ------
#> * For help interpreting the printed output see ?print.stanreg
#> * For info on the priors used see ?prior_summary.stanreg
tictoc::toc()
#> 0.56 sec elapsed

tictoc::tic()
stan_glm(mpg ~ wt, data = mtcars, refresh = 0)
#> stan_glm
#>  family:       gaussian [identity]
#>  formula:      mpg ~ wt
#>  observations: 32
#>  predictors:   2
#> ------
#>             Median MAD_SD
#> (Intercept) 37.3    2.0  
#> wt          -5.4    0.6  
#> 
#> Auxiliary parameter(s):
#>       Median MAD_SD
#> sigma 3.1    0.4   
#> 
#> ------
#> * For help interpreting the printed output see ?print.stanreg
#> * For info on the priors used see ?prior_summary.stanreg
tictoc::toc()
#> 0.84 sec elapsed

options(mc.cores = parallel::detectCores())

tictoc::tic()
stan_glm(mpg ~ wt, data = mtcars, refresh = 0)
#> stan_glm
#>  family:       gaussian [identity]
#>  formula:      mpg ~ wt
#>  observations: 32
#>  predictors:   2
#> ------
#>             Median MAD_SD
#> (Intercept) 37.3    2.0  
#> wt          -5.3    0.6  
#> 
#> Auxiliary parameter(s):
#>       Median MAD_SD
#> sigma 3.1    0.4   
#> 
#> ------
#> * For help interpreting the printed output see ?print.stanreg
#> * For info on the priors used see ?prior_summary.stanreg
tictoc::toc()
#> 15.94 sec elapsed

tictoc::tic()
stan_glm(mpg ~ wt, data = mtcars, refresh = 0)
#> stan_glm
#>  family:       gaussian [identity]
#>  formula:      mpg ~ wt
#>  observations: 32
#>  predictors:   2
#> ------
#>             Median MAD_SD
#> (Intercept) 37.2    2.0  
#> wt          -5.3    0.6  
#> 
#> Auxiliary parameter(s):
#>       Median MAD_SD
#> sigma 3.1    0.4   
#> 
#> ------
#> * For help interpreting the printed output see ?print.stanreg
#> * For info on the priors used see ?prior_summary.stanreg
tictoc::toc()
#> 16.89 sec elapsed

parallel::detectCores()
#> [1] 16

sessionInfo()
#> R version 4.2.3 (2023-03-15 ucrt)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 19045)
#> 
#> Matrix products: default
#> 
#> locale:
#> [1] LC_COLLATE=English_United States.utf8 
#> [2] LC_CTYPE=English_United States.utf8   
#> [3] LC_MONETARY=English_United States.utf8
#> [4] LC_NUMERIC=C                          
#> [5] LC_TIME=English_United States.utf8    
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] dplyr_1.1.2     rstanarm_2.26.1 Rcpp_1.0.11    
#> 
#> loaded via a namespace (and not attached):
#>   [1] nlme_3.1-162         matrixStats_1.0.0    fs_1.6.3            
#>   [4] xts_0.13.1           threejs_0.3.3        rstan_2.26.23       
#>   [7] tensorA_0.36.2       R.cache_0.16.0       backports_1.4.1     
#>  [10] tools_4.2.3          utf8_1.2.3           R6_2.5.1            
#>  [13] DT_0.29              colorspace_2.1-0     withr_2.5.0         
#>  [16] tictoc_1.2           tidyselect_1.2.0     gridExtra_2.3       
#>  [19] prettyunits_1.1.1    processx_3.8.2       compiler_4.2.3      
#>  [22] cli_3.6.1            shinyjs_2.1.0        posterior_1.4.1     
#>  [25] colourpicker_1.3.0   checkmate_2.2.0      scales_1.2.1        
#>  [28] dygraphs_1.1.1.6     callr_3.7.3          QuickJSR_1.0.6      
#>  [31] stringr_1.5.0        digest_0.6.33        StanHeaders_2.26.27 
#>  [34] minqa_1.2.5          rmarkdown_2.24       R.utils_2.12.2      
#>  [37] base64enc_0.1-3      pkgconfig_2.0.3      htmltools_0.5.6     
#>  [40] lme4_1.1-34          styler_1.10.2        fastmap_1.1.1       
#>  [43] htmlwidgets_1.6.2    rlang_1.1.1          rstudioapi_0.15.0   
#>  [46] shiny_1.7.5          farver_2.1.1         generics_0.1.3      
#>  [49] zoo_1.8-12           jsonlite_1.8.7       crosstalk_1.2.0     
#>  [52] gtools_3.9.4         distributional_0.3.2 R.oo_1.25.0         
#>  [55] inline_0.3.19        magrittr_2.0.3       loo_2.6.0           
#>  [58] bayesplot_1.10.0     Matrix_1.5-3         munsell_0.5.0       
#>  [61] fansi_1.0.4          abind_1.4-5          lifecycle_1.0.3     
#>  [64] R.methodsS3_1.8.2    stringi_1.7.12       yaml_2.3.7          
#>  [67] MASS_7.3-58.2        pkgbuild_1.4.2       plyr_1.8.8          
#>  [70] grid_4.2.3           parallel_4.2.3       promises_1.2.1      
#>  [73] crayon_1.5.2         miniUI_0.1.1.1       lattice_0.20-45     
#>  [76] splines_4.2.3        knitr_1.43           ps_1.7.5            
#>  [79] pillar_1.9.0         igraph_1.5.1         boot_1.3-28.1       
#>  [82] markdown_1.8         shinystan_2.6.0      reshape2_1.4.4      
#>  [85] codetools_0.2-19     stats4_4.2.3         rstantools_2.3.1.1  
#>  [88] reprex_2.0.2         glue_1.6.2           evaluate_0.21       
#>  [91] RcppParallel_5.1.7   nloptr_2.0.3         vctrs_0.6.3         
#>  [94] httpuv_1.6.11        gtable_0.3.4         purrr_1.0.2         
#>  [97] ggplot2_3.4.3        xfun_0.40            mime_0.12           
#> [100] xtable_1.8-4         later_1.3.1          survival_3.5-3      
#> [103] tibble_3.2.1         shinythemes_1.2.0    ellipsis_0.3.2

Created on 2024-02-06 with reprex v2.0.2