Cmdstanr sampling very slow compared to rstan

gioiadc · December 11, 2024, 3:51pm

Hello,

Yesterday, I installed cmdstanr on my MacBook Pro M1 2021, but I noticed that the times to get the same model through rstan are much larger (about three times).

Here is the detail of the times with cmdstanr:

All four chains finished successfully.
Mean chain execution time: 1067.0 seconds.
Total execution time: 1162.9 seconds.

While these are the times with rstan:

Chain 2: Elapsed Time: 236,488 seconds (Warm-up)
Chain 2: 114.685 seconds (Sampling)
Chain 2: 351.173 seconds (Total)
Chain 4:
Chain 4: Elapsed Time: 229,562 seconds (Warm-up)
Chain 4: 126.206 seconds (Sampling)
Chain 4: 355,768 seconds (Total)
Chain 1:
Chain 1: Elapsed Time: 234,668 seconds (Warm-up)
Chain 1: 126.342 seconds (Sampling)
Chain 1: 361.01 seconds (Total)
Chain 3:
Chain 3: Elapsed Time: 220.462 seconds (Warm-up)
Chain 3: 159.392 seconds (Sampling)
Chain 3: 379,854 seconds (Total)

These are the commands to run the model via cmdstanr:

comp_model <- cmdstan_model("MS_Normal_Gumbel_K2_test.stan")
out2 <- comp_model$sample(data = stan_dat,
 init = init_par,
 seeds = 123,
 chains = 4,
 parallel_chains = 4,
 iter_sampling = 1000,
 iter_warmup = 1000,
 adapt_delta = 0.99,
 step_size = 0.1,
 max_treedepth = 10)

This is the command to run the same model via rstan:

out2 <- stan(file = "MS_Normal_Gumbel_K2_test.stan", 
data = stan_dat, iter = 2000,
cores = 4, init = init_par, include = TRUE,
control = list(adapt_delta = 0.99, stepsize = 0.1, adapt_gamma = 0.75, adapt_t0 = 10, max_treedepth = 10),
pars = c("betaL1_new","betaL2_new","betaS1_new","betaS2_new",
"ltauL1", "betaTp01_new", "betaTp10_new","pit", "log_lik", "lp__"))

This behavior seems very strange to me since we noticed a performance improvement while running the same model on Windows. I tried to set the variables following this topic: https://discourse.mc-stan.org/t/stan-compile-with-apple-accelerate-blas-lapack/32164/5 but without results.

I wanted to ask if you think this behavior is plausible or if you have any suggestions on improving cmdstanr performance.

Some info that might be useful:

Macbook Pro M1 2021
Sequoia 15.1.1

R version 4.4.2 (2024-10-31)
Platform: aarch64-apple-darwin20
Running under: macOS Sequoia 15.1.1

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0

CmdStan Version: 2.36
rstan Version: 2.32.6
Compiler/Toolkit:

clang++ --version

Apple clang version 16.0.0 (clang-1600.0.26.4)

Target: arm64-apple-darwin24.1.0

Thread model: posix

InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin

make --version
GNU Make 3.81
Copyright (C) 2006  Free Software Foundation, Inc.
This is free software; see the source for copying conditions.
There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE.

This program built for i386-apple-darwin11.3.0

Thank you!

jonah · December 11, 2024, 8:33pm

Hi @gioiadc! One thing to try is set some other flags in the make/local file. For example:

cmdstanr::cmdstan_make_local(cpp_options = list("CXXFLAGS += -O3 -march=native -mtune=native -Wno-deprecated-declarations", append = FALSE)
cmdstanr::rebuild_cmdstan(cores = 4)

You can set append=TRUE if you don’t wan’t to erase what you already have in make/local, but it might be good to start fresh and see if -O3 -march=native -mtune=native helps (at least on Mac, I think there may be a problem with those march and/or mtune options on Windows).

@stevebronder @WardBrian any other ideas why CmdStan would be so much slower than RStan in this case?

Bob_Carpenter · December 11, 2024, 8:49pm

Hi, @gioiadc and welcome to the Stan forums. And thanks for the careful report. For some reason, the cmdstan model is missing the adapt_gamma and adapt_t0 arguments, but I’ve never seen anyone try to set those and don’t even recall what they do.

Was rstan running the four chains in parallel? The M1 should be fast at parallelizing, so this is probably not the whole reason.

Is there a lot of sampling data to read in when it’s done (how many parameters and how many draws)? I think either cmdstanr or rstan or both read the data back into memory pro-actively and that can take a lot of time for large numbers of draws. @Jonah should know how much work cmdstanr and rstan do proactively.

This is off topic for the original query, but itlooks like either (a) the output was edited, or (b) Stan’s inconsistent in its use of internationalization for decimal points. We should check our code and also limit the precision of this to two or three significant digits and just rounding to an integer when the number’s greater than 100.

gioiadc · December 12, 2024, 2:22pm

Dear Jonah and Bob, thanks for the fast reply!

@jonah: I have cleaned the make/local file and run your commands, but unfortunately, the sampling time with cmdstanr did not improve.

All 4 chains finished successfully.
Mean chain execution time: 1047.6 seconds.
Total execution time: 1173.8 seconds.

@Bob_Carpenter: when we started to work on our model with rstan, we played a bit with the NUTS parameters, but in the end, we left them at their default except for adapt_delta (increased at 0.99 to solve some divergencies). In cmdstanr, we wanted to recreate the same setting addressing the time difference to a difference in the NUTS setting, but there are no arguments for adapt_gamma and adapt_t0.

We run the four chains in parallel both in rstan and in cmdstanr, each chain has 2000 draws (1000 warm-up and 1000 sampling).

Our model has 19 parameters, 13339 transformed parameters, and 4438 generated quantities (in total, 17796). The dataset is of size 4437x7.

With rstan, we save 13327 parameters, and the file is understandably very large (almost 1GB), while with cmdstanr, without parameter selection, the whole file is surprisingly small (less than 1 MB!!).
Our main concern is that we can’t see the same great improvement switching to cmdstanr that we have running the same model in Windows.

My bad about the decimal points! I copied the output on an editor to write the question and then pasted it here. It changed (randomly) some decimals.

Thanks
Gioia

WardBrian · December 12, 2024, 2:55pm

Are you able to watch the processes run in something like htopor Activity Monitor to confirm they are running in parallel like you’re hoping?

My suspicion is that the chains are being serialized for some reason, because the slow-down is roughly proportional to the number of chains. I’m not sure if other cmdstanr output would make this obvious or not. If you run one chain, what time does that take?

gioiadc · December 12, 2024, 3:35pm

Hi Brian,
This is the screenshot of the Activity Monitor once the model with four chains was running confirming that 4 processes are working in parallel

I ran only one chain, and I obtained:
Chain 1 finished in 1103.1 seconds.

Thanks,
Gioia

scholz · December 13, 2024, 9:09am

Just a quick idea, what does it say about how long 1000 (or whatever the number is, I forgot) leapfrog steps took in comparison? Iirc that’s output at the start of sampling.

mitzimorris · December 13, 2024, 8:22pm

tldr: nesterov dual averaging parameters - no one should ever touch them.

this was discussed when we were developing cmdstanpy and cmdstanr. cf. add logic to `sample` function to handle dense HMC metric · Issue #28 · stan-dev/cmdstanpy · GitHub

cf this comment from Betancourt: Sampler parameters which the typical user might need to set (hmc_nuts_diag_e_adapt only) - #2 by betanalpha

gioiadc · December 14, 2024, 6:34pm

Here it is for rstan, but I don’t know how to obtain it for cmdstanr

Chain 1: 
Chain 1: Gradient evaluation took 0.006691 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 66.91 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1: 
Chain 1: 
Chain 1: Iteration:    1 / 2000 [  0%]  (Warmup)
Chain 2: 
Chain 2: Gradient evaluation took 0.007145 seconds
Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 71.45 seconds.
Chain 2: Adjust your expectations accordingly!
Chain 2: 
Chain 2: 
Chain 2: Iteration:    1 / 2000 [  0%]  (Warmup)
Chain 3: 
Chain 3: Gradient evaluation took 0.005847 seconds
Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 58.47 seconds.
Chain 3: Adjust your expectations accordingly!
Chain 3: 
Chain 3: 
Chain 3: Iteration:    1 / 2000 [  0%]  (Warmup)
Chain 4: 
Chain 4: Gradient evaluation took 0.00551 seconds
Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 55.1 seconds.
Chain 4: Adjust your expectations accordingly!
Chain 4: 
Chain 4: 
Chain 4: Iteration:    1 / 2000 [  0%]  (Warmup)
...
Chain 1:  Elapsed Time: 195.37 seconds (Warm-up)
Chain 1:                104.088 seconds (Sampling)
Chain 1:                299.458 seconds (Total)
Chain 1: 
Chain 3: Iteration: 1800 / 2000 [ 90%]  (Sampling)
Chain 2: Iteration: 2000 / 2000 [100%]  (Sampling)
Chain 2: 
Chain 2:  Elapsed Time: 198.456 seconds (Warm-up)
Chain 2:                119.315 seconds (Sampling)
Chain 2:                317.771 seconds (Total)
Chain 2: 
Chain 4: Iteration: 2000 / 2000 [100%]  (Sampling)
Chain 4: 
Chain 4:  Elapsed Time: 208.745 seconds (Warm-up)
Chain 4:                112.478 seconds (Sampling)
Chain 4:                321.223 seconds (Total)
Chain 4: 
Chain 3: Iteration: 2000 / 2000 [100%]  (Sampling)
Chain 3: 
Chain 3:  Elapsed Time: 222.822 seconds (Warm-up)
Chain 3:                113.564 seconds (Sampling)
Chain 3:                336.386 seconds (Total)
Chain 3:

jonah · December 14, 2024, 7:53pm

Oh that’s right, it’s suppressed by default in CmdStanR. If you have a fitted model object fit you should be able to find that info by calling fit$output(), which returns a list with one element per chain, where each element is a character vector of all the output lines. So you could do something like this:

fit <- cmdstanr_example()
output <- fit$output() # this is a list with one element per chain 
cbind(Chain = 1:4, Timing = sapply(output, function(x) grep("Gradient evaluation", x, value = TRUE)))

Which gives me:

   Chain   Timing                                    
[1,] "1"   "Gradient evaluation took 2.2e-05 seconds"
[2,] "2"   "Gradient evaluation took 2.1e-05 seconds"
[3,] "3"   "Gradient evaluation took 2.4e-05 seconds"
[4,] "4"   "Gradient evaluation took 2.5e-05 seconds"

Or you can replace grepping for “Gradient evaluation” with something about the leapfrog steps if you prefer that info.

gioiadc · December 14, 2024, 8:04pm

Thanks Jonah! Here they are:

cbind(Chain = 1:4, sapply(output, function(x) grep("leapfrog", x, value = TRUE)))
     Chain                                                                                    
[1,] "1"   "1000 transitions using 10 leapfrog steps per transition would take 52.09 seconds."
[2,] "2"   "1000 transitions using 10 leapfrog steps per transition would take 51.54 seconds."
[3,] "3"   "1000 transitions using 10 leapfrog steps per transition would take 61.46 seconds."
[4,] "4"   "1000 transitions using 10 leapfrog steps per transition would take 55.13 seconds."

Seems they are very similar to the rstan ones:

Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 66.91 seconds.
Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 71.45 seconds.
Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 58.47 seconds.
Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 55.1 seconds.

And the gradient evaluation with cmdstanr

     Chain                                            
[1,] "1"   "Gradient evaluation took 0.005209 seconds"
[2,] "2"   "Gradient evaluation took 0.005154 seconds"
[3,] "3"   "Gradient evaluation took 0.006146 seconds"
[4,] "4"   "Gradient evaluation took 0.005513 seconds"

vs the rstan one

Chain 1: Gradient evaluation took 0.006691 seconds
Chain 2: Gradient evaluation took 0.007145 seconds
Chain 3: Gradient evaluation took 0.005847 seconds
Chain 4: Gradient evaluation took 0.00551 seconds

jsocolar · December 15, 2024, 12:20am

This might be a long shot, but is your hard drive very close to 100% full, or is there some other reason why writing to your drive might be slow?

gioiadc · December 19, 2024, 6:31pm

Hi, we took some time to investigate the cmdstanr performance on my laptop. We noticed that the algorithm setting we were using in stan is not the default one since gamma should be 0.05 and the step size 1, while we have set 0.75 and 0.1, respectively. It is not currently possible to set gamma in cmdstanr, so the settings we were comparing were different.
Using the default setting in cmdstanr and stan leads to negligible time differences, and also, the storage is similar in saving the csv files to retrieve the model in cmdstanr.
Thanks for your help!

Bob_Carpenter · December 24, 2024, 9:16pm

Thanks much for reporting back on what was going on, @gioiadc.

Does your model work a lot better with those settings? It might convince our devs to add them to cmdstanr and cmdstanpy.

Topic		Replies	Views
Runtime difference between cmdstan v2.28.2 and v2.27.0 CmdStan	7	954	December 10, 2021
Rstan versus cmdstan run times Developers cmdstan , rstan , cmdstanr	1	519	December 29, 2023
Cmdstanr doesn't sample, but rstan does Interfaces rstan , cmdstanr	2	622	December 28, 2022
Speed difference between rstan and cmdstan for a simple model CmdStan rstan , techniques	25	3265	November 7, 2021
Stan and cmdstan running slow Modeling	8	712	March 24, 2022

Cmdstanr sampling very slow compared to rstan

Related topics