Poor performance for compiled Stan models

I purchased a new computer with a powerful multicore CPU to run my Stan models faster. However I am having some difficulty configuring my compiler (at least I think that’s the culprit.) Any advice on speeding things up would be greatly appreciated.

When I run an example model with rstanarm with my new computer (Windows 10), and compare to my old setup (OSX), the new machine samples about twice as fast. Great! I would have expected about this much of a speedup from a significantly faster computer.

However, compiled programs (rstanarm models are precompiled I think?) are actually running about 20%ish slower on my new computer, compared to my older slow machine. I am new to Windows and am wondering whether it has something to do with the compiler. (I have tried installing Ubuntu but alas have not succeeded yet.)

I tried illustrating this with an example below. Interestingly, there is also a large performance difference between rstanarm and brms but I am not sure if that is related to my question.

First, I set everything up following the Rstan getting started guide:

dotR <- file.path(Sys.getenv("HOME"), ".R")
if (!file.exists(dotR)) dir.create(dotR)
M <- file.path(dotR, ifelse(.Platform$OS.type == "windows", "Makevars.win", "Makevars"))
if (!file.exists(M)) file.create(M)
# cat("\nCXX14FLAGS=-O3 -march=native -mtune=native",
#     if( grepl("^darwin", R.version$os)) "CXX14FLAGS += -arch x86_64 -ftemplate-depth-256" else
#       if (.Platform$OS.type == "windows") "CXX11FLAGS=-O3 -march=corei7 -mtune=corei7" else
#         "CXX14FLAGS += -fPIC",
#     file = M, sep = "\n", append = TRUE)
# file.edit(M)
## [1] ""                                          
## [2] "CXX14FLAGS=-O3 -march=native -mtune=native"
## [3] "CXX11FLAGS=-O3 -march=corei7 -mtune=corei7"
    Reaction ~ Days + (Days | Subject), 
    data = sleepstudy, 
    iter = 10000, 
    chains = 1
## SAMPLING FOR MODEL 'continuous' NOW (CHAIN 1).
## Chain 1:  Elapsed Time: 13.658 seconds (Warm-up)
## Chain 1:                9.733 seconds (Sampling)
## Chain 1:                23.391 seconds (Total)
##    user  system elapsed 
##   24.31    0.01   24.32

I ran the same code with my old OSX computer, which resulted in ~50s runtime. I am very happy for this speedup. However, with compiled models things are different, below. I first compile the model (so I can time just the sampling, afterwards):

brms_sampler <- brm(
  Reaction ~ Days + (Days | Subject), 
  data = sleepstudy, 
  chains = 0

And then sample

  update(brms_sampler, iter = 10000, chains = 1)
## SAMPLING FOR MODEL '49d09615cc885efe2c01f11482fd096d' NOW (CHAIN 1).
## Chain 1:  Elapsed Time: 8.693 seconds (Warm-up)
## Chain 1:                7.492 seconds (Sampling)
## Chain 1:                16.185 seconds (Total)
##    user  system elapsed 
##   16.61    0.00   16.61

(I don’t know why brms would be this much faster than rstanarm!) I ran the same code with my old slower OSX machine, and it ran in ~12 seconds. So I wonder what is hurting the performance of my newer machine such that compiled models run slower than equivalent ones on a slower machine? I also wonder whether this has something to do with the CmdStan installation error I posted earlier (Error installing CmdStan (Windows 10; Command 'mingw32-make.exe' not found @win/processx.c:983)).

Session and computer details below:

## R version 4.0.2 (2020-06-22)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 18362)
## Matrix products: default
## locale:
## [1] LC_COLLATE=English_United Kingdom.1252 
## [2] LC_CTYPE=English_United Kingdom.1252   
## [3] LC_MONETARY=English_United Kingdom.1252
## [4] LC_NUMERIC=C                           
## [5] LC_TIME=English_United Kingdom.1252    
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## other attached packages:
## [1] brms_2.13.3     rstanarm_2.19.3 Rcpp_1.0.4.6    lme4_1.1-23    
## [5] Matrix_1.2-18  
[1] "AuthenticAMD"

[1] "AMD Ryzen 9 3900X 12-Core Processor"

[1] 24

Thanks very much for your time

Can u try running things under wsl? This is the windows for Linux sub system thing. People reported that things run a lot faster with it due to better compilers.

Thanks @wds15. I actually have tried with WSL but have not yet been able to install all the required packages (there are myriad errors that I have not yet had time to parse). I also would not be happy with WSL as a long term solution because I’d be stuck with the command line (although Linux GUI apps are coming soon, apparently, so could use RStudio when that happens.)

I’ll report back if/when I get R & Stan working in WSL.

In general, performance on Windows is slower than on Linux/macOS.

The performance degradation can be up to ~50% or more. So anything you gained with the performance of the new computer vs old could have been lost on the fact you are using Windows. If you want to use Windows and want maximum performance with Stan then I suggest you use WSL as @wds15 suggested.

This difference is due to Stan using the mingw compilers and libraries not native msvc or clang.

Thanks @rok_cesnovar! Does that also explain the performance difference between rstanarm and brms?

Hopefully WSL provides GUI application support soon (for those of us who find R without RStudio difficult…)

I dont have enough knowledge here to give an answer, sorry.
Is there no difference between the two on your old machine?

GUI WSL is unlikely to happen anytime soon. What is more likely and would have the same effect is Rstudio on Windows running with R in WSL. Your other option is running rstudio server in WSL (https://medium.com/lead-and-paper/how-to-use-rstudio-server-for-ubuntu-on-windows-10-a7aeee661a5d).

This thread is relevant to this discussion: Large Cmdstan performance differences Windows vs. Linux

Well actually, WSL GUI might happen: https://devblogs.microsoft.com/commandline/the-windows-subsystem-for-linux-build-2020-summary/#wsl-gui

Sorry for the wrong info before.

Also found this post that claims they have rstudio working in wsl2 directly: https://github.com/rstudio/rstudio/issues/6760#issuecomment-650425933

But do that at your own risk :)

Thanks both @rok_cesnovar and @wds15, I managed to get everything working in WSL. The brms model now samples in 5 seconds. Brilliant!

Regarding the difference between rstanarm and brms: brms is a lot faster on both old OSX machine and new Windows machine.

I’ll take a look at the Rstudio server in WSL option. It looks really good.


Wow, thanks! Ill try that right away

After using windows for about a week now I can safely advise anyone reading to stay away from it…

1 Like

Hey @matti, the speed difference between rstanarm and brms will depend on the model and shouldn’t really be related to changes in hardware. For many models there shouldn’t be much difference between the two packages, but in some cases there will be. The key speed issue is not the pre-compilation but the fact that all rstanarm Stan code is pre-written and has to anticipate all possible user choices in a single Stan program (or minimal Stan programs). For certain types of models (like with varying intercepts and varying slopes) the way rstanarm has to code the design matrices makes it slower than what brms can do by writing the Stan code on the fly based on the user’s choices (brms can code around doing an inefficient matrix multiplication).

So pre-compiled models are super convenient (and have the big advantage of not requiring the end using to deal with C++ toolchain issues) but there are particular cases where it is harder to code the model in the most efficient way possible.


Great, thanks for the explanation @jonah!