Large Cmdstan performance differences Windows vs. Linux

I’m running the same model on a dual-boot system (Windows 10 & Kubuntu 19.04) and finding that Windows is taking over three times longer to complete the same model. I’m not sure where the performance differences are coming from.

On both systems I’m running a clean install of cmdstan (tested both the 2.21 release and develop). I’ll attach the model and data to the end of this post as well.

2.21

devtools::install_github("stan-dev/cmdstanr")
library(cmdstanr)
install_cmdstan()
source("test_data.R")
mod = cmdstan_model("F1_Base.stan")
samp = mod$sample(data=test_data,num_chains = 4,num_cores=4)

Linux

Chain 4 finished in 208.3 seconds.
Chain 2 finished in 210.1 seconds.
Chain 3 finished in 210.2 seconds.
Chain 1 finished in 211.8 seconds.

All 4 chains finished succesfully.
Mean chain execution time: 210.1 seconds.
Total execution time: 211.8 seconds.

Windows

Chain 4 finished in 711.0 seconds.
Chain 2 finished in 712.5 seconds.
Chain 3 finished in 734.3 seconds.
Chain 1 finished in 908.3 seconds.

All 4 chains finished succesfully.
Mean chain execution time: 766.5 seconds.
Total execution time: 908.4 seconds.

Develop

library(cmdstanr)
install_cmdstan(repo_clone = TRUE, repo_branch = "develop", overwrite = TRUE)
source("test_data.R")
mod = cmdstan_model("F1_Base.stan")
samp = mod$sample(data=test_data,num_chains = 4,num_cores=4)

Linux

Chain 3 finished in 199.4 seconds.
Chain 4 finished in 200.2 seconds.
Chain 1 finished in 201.3 seconds.
Chain 2 finished in 210.9 seconds.

All 4 chains finished succesfully.
Mean chain execution time: 202.9 seconds.
Total execution time: 211.0 seconds.

Windows

Chain 2 finished in 683.5 seconds.
Chain 1 finished in 687.2 seconds.
Chain 4 finished in 689.0 seconds.
Chain 3 finished in 903.3 seconds.

All 4 chains finished succesfully.
Mean chain execution time: 740.8 seconds.
Total execution time: 903.4 seconds.

F1_Base.stan (2.3 KB) test_data.R (84.1 KB)

2 Likes

I was seeing this when doing performance test for some papers awhile ago. And when comparing execution times of some models with the other guys from our group that are all on Windows. Its not great.

Hopefully you are you using RTools3.5? I am hoping rtools 4.0 will bring us closer to linux performance as that will switch from g++4.9.3 to g++8 (or 7). If its not that, then its probably the mingw libs that are used?

In which case we would need to support the microsoft c++ compiler to get there (I think). Or get clang++ support for Windows (which is more or less the same as clang is trying to be mvcc compatible).

@ahartikainen you are on Windows right? Have you switched to msys2 which is part of rtools 4.0? And if so, does it make a difference?

1 Like

Ah, that makes sense. Looking forward to RTools 4 even more now!

I can test this.

Do you think this is something we can see on Github Actions? Let me set up a repo for this.

Do you have some model that is a problem?

2 Likes

We can probably just try with the model @andrjohns posted above.

Testing on Github actions or Appveyor with Windows will probably be easiest to test this without breaking our own environments :)

1 Like

I decided to give it a shot on my system and Rtools 4.0 (based on Msys2) and saw 15-20% faster times but not much more than that. So this needs a deeper look.

There is a guide on setting up msys2 by @ChrisChiasson on the Math wiki https://github.com/stan-dev/math/wiki/Windows-Development-Notes

2 Likes

I added a small ‘skeleton’ repo for the comparisons.

It uses Github Actions, but I’m not sure can you really compare results.

See ‘Actions’ -tab to see CI results. It should save timing (and summary) as a csv. --> manu csv for post-process.

Apparently macOS is slower too.

3 Likes

I have some weird performance issues on Windows.

PyStan complains alot with the initialization.

Yeah the nested arrays of ordered vectors (ordered[T] thresholds_raw[G,J]) don’t initialise very well, but the sampler sorts itself out well enough

1 Like

I don’t understand the purpose of the test model and data - https://github.com/ahartikainen/stan_performance_testing/blob/master/Stan_models/F1_Base.stan, https://github.com/ahartikainen/stan_performance_testing/blob/master/Stan_models/F1_Base.data.R

it appears that the sampler is having a very hard time during initialization, as would be expected if you’re trying to fit a model to data that doesn’t come from that model. is this what’s going on? do you think this is tickling a bug in the sampler under some compiler?

what aspects of the PyStan interface is this testing?

the existing Stan performance tests check that changes to the algorithms and math library don’t degrade either performance or accuracy using a set of known models and data:

maybe you should use these for testing instead of the adhoc model? see https://github.com/stan-dev/performance-tests-cmdstan

addendum: or else we should add some version of the adhoc model to the performance tests.

sorry if this is a naive question, but should the sampler sort this out?

same question as above: is there a problem with Stan or is this a library/compiler issue?

It is more of a performance thing. Why is Windows (/and macOS) so much slower than Linux?

Also why I can’t get CmdStanPy to fail or sample on Windows. It justs gets “stuck”. Or maybe I need to inject some cout to see the progress in the background.

Yes, I planning to do that in future, this was first to get things running

not good. what does CmdStan do? you should be able to get the exact command and run it by hand.

Which versions are you referring to here? I think I recall that macOS did catch up in speed once I introduced the TBB malloc library under macOS. That gave macOS a serious speed bump. Linux did not benefit from this which suggests that the memory management under Linux is better from the start and the TBB made the difference on macOS.

The TBB was introduced in 2.21.

I’m testing agains 2.19 and 2.22.1

Results are here (I have not yet gathered them to a plot; csv files). I plan to add another job that gathers these to a plot and is served on the repo.

e.g.
pystan results
cmdstanpy results

And when these links get too old, you can access artifacts from here

edit. Also, testing against 1 seed is not really a good idea

Maybe we can use these results to discourage people from using Windows and eventually deprecate the platform altogether. (One has to keep one’s dreams alive!)

We want to keep the conditions the same. Given that numerics are not identical across OS/CPU/compiler/setting, results can diverge after some number of iterations, even with the same seed. What we really want to test is basic iteration time to do fixed iterations. So HMC is probably better than NUTS as it controls the number of log density and gradient evals per iteration. Though it doesn’t test everything and it could be the MCMC algorithm slowing things down.

4 Likes

Actually, I procrastinated a bit and found the solution (well, at least it looks promising).

WSL (Windows Subsystem for Linux) https://docs.microsoft.com/en-us/windows/wsl/install-win10
Its available in the Microsoft store. Easy to install, took me 5 minutes, no hiccups.

Had to install make and g++ with
sudo apt install make g++

No mingw32-make, no TBB path stuff, just works! Sweet. You can edit all files normally. C: is under /mnt/c

And the times for your model @andrjohns :
Native Windows + RTools 4.0: 400s
WSL: 140s !!

I would encourage anyone running Windows that use cmdstan, cmdstanpy or run rstan or cmdstanr via the console to try it out. Easy to try & doesnt break anything.

Rstudio doesnt work with WSL directly yet, but you can start a rstudio cloud in WSL and use it with the browser. Though that sounds a bit too meta for my blood: https://medium.com/lead-and-paper/how-to-use-rstudio-server-for-ubuntu-on-windows-10-a7aeee661a5d

There is also WSL2 that needs a latest build of Win10 that also runs GPUs and all.

8 Likes

Cool! Super weird.

sounds great!