CmdStanR with Windows Subsystem for Linux may compile model way slower than CmdStanR with (pure) Windows?

I found that the compiling speed of CmdStanR (0.5.3) with Windows Subsystem for Linux (WSL) is 6 times slower than CmdStanR with (pure) Windows, at least in my environment. Is this reproducible phenomenon in any other Windows machines, or is this specific to me…? Is there any method to speed up the compilation of CmdStanR with WSL? Any ideas are appreciated.

Test case

My R environment

> sessionInfo()
R version 4.2.0 (2022-04-22 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19044)

Matrix products: default

locale:
[1] LC_COLLATE=Japanese_Japan.utf8  LC_CTYPE=Japanese_Japan.utf8
[3] LC_MONETARY=Japanese_Japan.utf8 LC_NUMERIC=C
[5] LC_TIME=Japanese_Japan.utf8

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base

other attached packages:
[1] cmdstanr_0.5.3

loaded via a namespace (and not attached):
 [1] pillar_1.8.0         compiler_4.2.0       tools_4.2.0
 [4] jsonlite_1.8.0       lifecycle_1.0.1      tibble_3.1.8
 [7] gtable_0.3.0         checkmate_2.1.0      pkgconfig_2.0.3
[10] rlang_1.0.3          cli_3.3.0            DBI_1.1.3
[13] xfun_0.31            withr_2.5.0          dplyr_1.0.9
[16] knitr_1.39           generics_0.1.3       vctrs_0.4.1
[19] tictoc_1.0.1         grid_4.2.0           tidyselect_1.1.2
[22] data.table_1.14.2    glue_1.6.2           R6_2.5.1
[25] processx_3.7.0       fansi_1.0.3          distributional_0.3.0
[28] tensorA_0.36.2       ggplot2_3.3.6        farver_2.1.1
[31] purrr_0.3.4          posterior_1.2.2      magrittr_2.0.3
[34] ps_1.7.1             backports_1.4.1      scales_1.2.0
[37] abind_1.4-5          assertthat_0.2.1     colorspace_2.0-3
[40] renv_0.15.5          utf8_1.2.2           munsell_0.5.0 

My WSL environment

  • Ubuntu 20.04 LTS
  • I have installed BLAS packages as shown below into the WSL, following @avehtari 's post:
sudo apt-get install liblapacke-dev
sudo apt-get install liblapacke
sudo apt-get install libopenblas-dev

sudo apt-get install libopenblas-serial-dev
sudo apt-get install libopenblas0
sudo apt-get install libopenblas0-serial

Reproducible code

  1. Open R from Windows (NOT from WSL; I do not have R on my WSL)
  2. Install CmdStan, both WSL version and native Windows version, with the following cpp_options respectively:
    • for WSL version, cpp_options <- list( "CXXFLAGS += -march=native -mtune=native -DEIGEN_USE_BLAS -DEIGEN_USE_LAPACKE", "LDLIBS += -lblas -llapack -llapacke" )
    • for native Windows version, cpp_options <- list( "CXXFLAGS += -Wno-nonnull", "TBB_CXXFLAGS= -U__MSVCRT_VERSION__ -D__MSVCRT_VERSION__=0x0E00" )
  3. Run the code as follows (the code originally came from Getting started with CmdStanR):
library(cmdstanr)

file <- file.path(cmdstan_path(), "examples", "bernoulli", "bernoulli.stan")
data_list <- list(N = 10, y = c(0, 1, 0, 0, 0, 0, 0, 0, 0, 1))

## 2.30.1 (Non-WSL)
set_cmdstan_path("C:/Users/MY_USER_NAME/Documents/.cmdstan/cmdstan-2.30.1")
cmdstan_path()
cmdstan_version()

tictoc::tic()
mod_2.30.1_non_wsl <- cmdstan_model(
  file,
  force_recompile = TRUE ## since the same model is run multiple times
)
tictoc::toc() # 11.89 sec, 10.94 sec, 11.19 sec elapsed

tictoc::tic()
fit_2.30.1_non_wsl <- mod_2.30.1_non_wsl$sample(
  data = data_list,
  seed = 123,
  chains = 4,
  parallel_chains = 4,
  refresh = 500 # print update every 500 iters
)
tictoc::toc() # 3.4 sec, 3.37 sec, 3.34 sec elapsed

## 2.30.1 (WSL)
set_cmdstan_path("C:/Users/MY_USER_NAME/Documents/.cmdstan/wsl-cmdstan-2.30.1")
cmdstan_path()
cmdstan_version()

tictoc::tic()
mod_2.30.1_wsl <- cmdstan_model(
  file,
  force_recompile = TRUE ## since the same model is run multiple times
)
tictoc::toc() # 54.27 sec, 51.04 sec, 57.39 sec elapsed

tictoc::tic()
fit_2.30.1_wsl <- mod_2.30.1_wsl$sample(
  data = data_list,
  seed = 123,
  chains = 4,
  parallel_chains = 4,
  refresh = 500 # print update every 500 iters
)
tictoc::toc() # 4.94 sec, 4.64 sec, 4.67 sec elapsed

## 2.30.1 (WSL) with openblas setting
Sys.setenv(OPENBLAS_NUM_THREADS = "1")
Sys.getenv("OPENBLAS_NUM_THREADS")

set_cmdstan_path("C:/Users/MY_USER_NAME/Documents/.cmdstan/wsl-cmdstan-2.30.1")
cmdstan_path()
cmdstan_version()

tictoc::tic()
mod_2.30.1_wsl <- cmdstan_model(
  file,
  force_recompile = TRUE ## since the same model is run multiple times
)
tictoc::toc() # 54.19 sec, 60.53 sec, 51.86 sec elapsed

tictoc::tic()
fit_2.30.1_wsl <- mod_2.30.1_wsl$sample(
  data = data_list,
  seed = 123,
  chains = 4,
  parallel_chains = 4,
  refresh = 500 # print update every 500 iters
)
tictoc::toc() # 4.27 sec, 4.83 sec, 4.71 sec elapsed
1 Like

These options increase the compilation time a lot. As you didn’t have these options for native Windows, you are comparing different things. Compare first without these options.

1 Like

WSL has known performance issues if you are accessing the native windows file system which could be at play here. Make sure that your model and the output directory are inside the WSL world, not somewhere in /mnt/c

2 Likes

@avehtari

Thank you for your comment.

Could you explain why CXXFLAGS += -march=native -mtune=native -DEIGEN_USE_BLAS -DEIGEN_USE_LAPACKE", "LDLIBS += -lblas -llapack -llapacke options possibly make the compilation time slower? Since CXXFLAGS += -march=native -mtune=native is used for GCC optimisation (according to this page), I think that these settings should not slow down the compilation time. Would you tell me the mechanism that affect the compilation time by setting -DEIGEN_USE_BLAS -DEIGEN_USE_LAPACKE and LDLIBS += -lblas -llapack -llapacke?

I installed another wsl-cmdstan-2.30.1 but I did not set any cpp_options for this time. I ran the bernoulli example using wsl-cmdstan-2.30.1 with and without cpp_options three times each. However, the compilation time was always almost 1 min regardless of the presence of cpp_options. Therefore, cpp_options did not affect the compilation time in my environment. Did I miss anything in the test…?

library(cmdstanr)
file <- file.path(cmdstan_path(), "examples", "bernoulli", "bernoulli.stan")
data_list <- list(N = 10, y = c(0, 1, 0, 0, 0, 0, 0, 0, 0, 1))

###################################
## 2.30.1 (WSL) with cpp options ##
###################################

set_cmdstan_path("C:/Users/MY_USER_NAME/Documents/.cmdstan/wsl-cmdstan-2.30.1")
cmdstan_path()
cmdstan_version()
cmdstan_make_local()
# [1] "CXXFLAGS += -march=native -mtune=native -DEIGEN_USE_BLAS -DEIGEN_USE_LAPACKE"
# [2] "LDLIBS += -lblas -llapack -llapacke"

tictoc::tic()
mod_2.30.1_wsl <- cmdstan_model(
  file,
  force_recompile = TRUE ## since the same model is run multiple times
)
tictoc::toc()
# 69.18 sec, 58.63 sec, 53.92 sec elapsed

tictoc::tic()
fit_2.30.1_wsl <- mod_2.30.1_wsl$sample(
  data = data_list,
  seed = 123,
  chains = 4,
  parallel_chains = 4,
  refresh = 500 # print update every 500 iters
)
tictoc::toc()
# 4.88 sec, 5.05 sec, 4.72 sec elapsed

######################################
## 2.30.1 (WSL) without cpp option ##
######################################

set_cmdstan_path("C:/Users/MY_USER_NAME/Documents/.cmdstan/no-cpp-option/wsl-cmdstan-2.30.1")
cmdstan_path()
cmdstan_version()
cmdstan_make_local() # returns `character(0)`

tictoc::tic()
mod_2.30.1_wsl_no_cpp_option <- cmdstan_model(
  file,
  force_recompile = TRUE ## since the same model is run multiple times
)
tictoc::toc() # 55.7 sec, 48.33 sec, 50.39 sec elapsed

tictoc::tic()
fit_2.30.1_wsl_no_cpp_option <- mod_2.30.1_wsl_no_cpp_option$sample(
  data = data_list,
  seed = 123,
  chains = 4,
  parallel_chains = 4,
  refresh = 500 # print update every 500 iters
)
tictoc::toc() # 3.97 sec, 4 sec, 3.87 sec elapsed

@WardBrian

Thank you for your information. Since I am using WSL 2, not WSL 1, indeed I should have the performance issues related to file accessing, according to Microsoft’s documentation.

I copied and moved bernoulli.stan to the WSL world and gave it all permissions (executable, writable, and readable). However, bernoulli.stan in the WSL world is not accessible from R. What am I missing?

file <- file.path(cmdstan_path(), "examples", "bernoulli", "bernoulli.stan")

file.copy(
  from = file,
  to = "\\\\wsl$\\Ubuntu-20.04\\home\\my_user_name"
)

file_in_wsl <- "\\\\wsl$\\Ubuntu-20.04\\home\\my_user_name\\bernoulli.stan"

file.access(file_in_wsl, mode = 0) # 0 = The file exists
file.access(file_in_wsl, mode = 1) # -1 = not executable
file.access(file_in_wsl, mode = 2) # -1 = not writable
file.access(file_in_wsl, mode = 4) # -1 = not readable

## The following command does not change permission
## I changed the permissions of file from WSL 2,
## by `chmod 777 bernoulli.stan`
Sys.chmod(file_in_wsl, "777")

## Still not accessible from R
file.access(file_in_wsl, mode = 0) # 0 = The file exists
file.access(file_in_wsl, mode = 1) # -1 = not executable
file.access(file_in_wsl, mode = 2) # -1 = not writable
file.access(file_in_wsl, mode = 4) # -1 = not readable

The GCC optimization is not free, and I usually see 40%-50% increase in compilation time when using those options (sometimes more). The bernoulli.stan example seems to be simple enough that there is not much to optimize, and I don’t see any speed difference.

-DEIGEN_USE_BLAS -DEIGEN_USE_LAPACKE", "LDLIBS += -lblas -llapack -llapacke change which code is included and links to additional libraries, but the increase in compilation/linking time is much smaller than I remembered.

It seems the slowdown is then due to the file access performance issue. Is cmdstan also in SWL world filesystem? I see set_cmdstan_path("C:/Users/MY_USER_NAME/Documents/.cmdstan/wsl-cmdstan-2.30.1"), but I don’t whether this is in WSL or not.

1 Like

@avehtari

Thank you for your reply.

Now I understand these GCC optimisation settings for reducing sampling times increase compilation time.

The path is in Windows world, not WSL world and I have installed wsl-cmdstan-<version number> into Windows world using R running under Windows world. I did not see any instruction in NEWS.md of cmdstanr to install wsl-cmdstan-<version number> with anything special (e.g. install it on R running in WSL world, not Windows world; set additional PATH, etc). Also , @andrjohns 's pull request says ’ [t]he end-user simply needs to add wsl=TRUE to the install_cmdstan() call, and all model compilation and execution for that installation will subsequently be run through WSL.’ Therefore, I installed the WSL version of CmdStan in a way that I have installed Windows version.

I am not sure exactly what would be preventing you from accessing that file. I think to really compare you would also want to install R inside WSL to avoid anything crossing the WSL/Windows barrier

1 Like

@WardBrian

Thank you for your reply.

Indeed, running R inside WSL is important when comparing the running speed between the two versions of cmdstanr. However, I think cmdstanr::install_cmdstan(wsl = TRUE, ...) is intended to be designed to use CmdStan with speed boost from R running in Windows environment (If it is not the case, please tell me). Therefore, I did not try installing R inside WSL, and I want to see whether WSL version of cmdstanr that I operate from Windows outperforms Windows version of it.

I am still struggling to find the better compilation performance in WSL version as I posted in this thread, but I did enjoy its notable sampling speed when I ran my own models that were not reported here and more complicated than bernoulli example I used here.

I would defer to someone like @rok_cesnovar for what the WSL flag does for cmdstanr. If it is crossing file system barriers it would probably take some severe penalties in terms of speed. If a model was compiled using the WSL toolchain but ran entirely in the Windows space it might be fine, but you’ll still have things like libc which live on the WSL side

1 Like

Since @andrjohns says here that ‘all model compilation and execution for that installation will subsequently be run through WSL’, the model compilation may be done in WSL side, which may probably cause the speed issue I report in this thread.

OK, I will figure out what I am missing in WSL.

Thank you for your helpful comments!

Perhaps @andrjohns did something to avoid this, but my basic reading of that PR is that they’re using “wsl.exe” to run the commands, but they’ve left the files in /mnt/. This will be quite slow for all IO operations, especially for simple models which are IO bound

3 Likes

Oh I had no idea about the IO issues, thanks for the heads up! I’ll change the cmdstanr handling to use the WSL filesystem for storing cmdstan

3 Likes

It’s a very unfortunate regression on Window’s part. You can find more information (and a lot of very upset users) here: [wsl2] filesystem performance is much slower than wsl1 in /mnt · Issue #4197 · microsoft/WSL · GitHub