Disappointing execution time in cmdstan 2.27.0 vs. 2.25.0

Michael_Peck · August 4, 2021, 5:13pm

I recently got Ubuntu 20.04 running under WSL2 on one of my Windows 10 PCs and have been doing some performance comparisons. It appears that CmdStan 2.27.0’s execution time takes from 25% to 100% or more longer than 2.25.0 on the same model with the same data. Here is a sample Stan model:

functions {
  real partial_sum(int[] ind, int start, int end, vector y, matrix X,
                   real a, vector b, real sigma) {
  return normal_id_glm_lpdf(y[start:end] | X[start:end,], a, b, sigma);
  }
}
data {
  int<lower=0> N;
  int<lower=0> M;
  matrix[N, M] X;
  vector[N] y;
}
transformed data {
  int grainsize = 1;
  int ind[N] = rep_array(1, N);
}
parameters {
  real a;
  real<lower=0> sigma;
  vector[M] b;
}
model { 
  a ~ normal(0, 5);
  sigma ~ normal(0, 10);
  b ~ std_normal();
  
  target += reduce_sum(partial_sum, ind, grainsize, y, X, a, b, sigma);
}

I create fake data for it with the following R function:

fake_multi <- function(N=100000, M=50, alpha=1, sigma=0.25) {
  X <- matrix(rnorm(N*M), N, M)
  beta <- rnorm(M)
  y <- as.vector(alpha + X %*% beta + rnorm(N, sd=sigma))
  data_list <- list(N=N, M=M, X=X, y=y)
  list(data_list=data_list, beta=beta)
}

I have the release source distributions of cmdstan-2.25.0 and cmdstan-2.27.0 and both were built with the same compiler flags, specifically STAN_THREADS=true and STAN_CPP_OPTIMS=true.

I ran the models using cmdstanr with, for example

time_t_25 <- system.time(reg_t_25 <- multireg2t$sample(reg_dat, chains=4, parallel_chains=4, threads_per_chain=4, seed
=987654L))

then deleted the executable and recompiled with cmdstan-2.27.0. Here are some time comparisons:

> reg_t_25$time()
$total
[1] 79.65344

$chains
  chain_id warmup sampling  total
1        1 33.480   25.456 58.936
2        2 43.502   23.439 66.941
3        3 33.337   25.780 59.117
4        4 59.370   18.864 78.234

> reg_t_27$time()
$total
[1] 100.9266

$chains
  chain_id warmup sampling  total
1        1 42.651   32.131 74.782
2        2 53.790   30.842 84.632
3        3 43.951   33.104 77.055
4        4 72.114   27.417 99.531

The difference is even larger for the models I’m actually running at the moment. Here are some timing comparisons running in both WSL and Windows on the same PC – the Win side has Rtools40 installed. I have gcc 9.3 in WSL. These are also run with cmdstanr and then converted to stanfit objects:

> get_elapsed_time(sfit.146$stanfit)
         warmup  sample
chain:1 2408.44 1791.58
chain:2 2340.84 1795.19
chain:3 2405.28 1810.34
chain:4 2469.86 1814.97
> get_elapsed_time(sfit.146_25$stanfit)
         warmup  sample
chain:1 1143.73 898.871
chain:2 1098.66 900.943
chain:3 1238.32 878.939
chain:4 1218.81 871.352
> get_elapsed_time(sfit.146_win$stanfit)
         warmup  sample
chain:1 4590.94 3262.52
chain:2 3844.77 5569.48
chain:3 3885.73 2995.80
chain:4 3846.82 5108.44
> get_elapsed_time(sfit.146_win_25$stanfit)
         warmup  sample
chain:1 1369.74 1093.47
chain:2 1475.25 1065.62
chain:3 1377.00 1671.71
chain:4 1401.22 1062.80
>

sessionInfo() for the fake data runs:

> sessionInfo()
R version 4.1.0 (2021-05-18)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.2 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

Random number generation:
 RNG:     Mersenne-Twister
 Normal:  Inversion
 Sample:  Rounding

locale:
 [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8
 [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8
 [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C
[10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] cmdstanr_0.4.0.9000 zernike_3.7.1

loaded via a namespace (and not attached):
 [1] rstan_2.21.2         tidyselect_1.1.1     xfun_0.24
 [4] purrr_0.3.4          V8_3.4.2             colorspace_2.0-2
 [7] vctrs_0.3.8          generics_0.1.0       stats4_4.1.0
[10] loo_2.4.1            utf8_1.2.2           rlang_0.4.11
[13] pkgbuild_1.2.0       pillar_1.6.1         glue_1.4.2
[16] withr_2.4.2          distributional_0.2.2 matrixStats_0.60.0
[19] lifecycle_1.0.0      posterior_1.0.1      munsell_0.5.0
[22] gtable_0.3.0         codetools_0.2-18     inline_0.3.19
[25] knitr_1.33           callr_3.7.0          ps_1.6.0
[28] curl_4.3.2           parallel_4.1.0       fansi_0.5.0
[31] Rcpp_1.0.7           scales_1.1.1         backports_1.2.1
[34] checkmate_2.0.0      RcppParallel_5.1.4   StanHeaders_2.21.0-7
[37] jsonlite_1.7.2       abind_1.4-5          farver_2.1.0
[40] gridExtra_2.3        tensorA_0.36.2       ggplot2_3.3.5
[43] processx_3.5.2       dplyr_1.0.7          grid_4.1.0
[46] cli_3.0.1            tools_4.1.0          magrittr_2.0.1
[49] tibble_3.1.3         crayon_1.4.1         pkgconfig_2.0.3
[52] ellipsis_0.3.2       data.table_1.14.0    prettyunits_1.1.1
[55] R6_2.5.0             compiler_4.1.0

I have to say I was a little skeptical that R and Stan would perform better in WSL than in Windows on the same hardware, but I’m happy to have proved myself wrong. Now that I also have an X server working I have no reason to go back to R on Windows.

wds15 · August 4, 2021, 6:11pm

Index checking is on by default in 2.27…you can turn it off, please dig in the release notes…

stevebronder · August 4, 2021, 6:38pm

@Michael_Peck yes I would bet that @wds15 is right, check the PR here for more info

github.com/stan-dev/stanc3

Adding back in rvalue checks on 1d indexes

stan-dev:master ← stan-dev:revert-revert-521

opened 01:01PM - 10 Mar 21 UTC

bbbales2

+3928 -1222

## Release notes This is a revert of #656 to put the fix in #521 back in to f…ixes #489, fixes #776, fixes #847, fixes #731. The revert in #656 was cause there was a substantial performance hit (like 30%) of including this fix and we were in code freeze and the simplest thing was revert. I think it would be good to put this fix back in even without an immediate fix to the performance thing. We can re-evaluate the performance thing (lots of stuff has changed between error checks and indexing) and fix that later, but at least then the segfaults are out of the way. ## Copyright and Licensing By submitting this pull request, the copyright holder is agreeing to license the submitted work under the BSD 3-clause license (https://opensource.org/licenses/BSD-3-Clause)

mike-lawrence · August 4, 2021, 9:29pm

So once a model is debugged, what set of flags achieve max performance? I’ve seen STAN_NO_RANGE_CHECKS, STAN_CPP_OPTIMS, and @stevebronder mentioned elsewhere that -march=native -mtune=native -O3 -g0 should also be used. Anything else?

Michael_Peck · August 4, 2021, 9:42pm

OK, thanks for the quick response. I rebuilt 2.27.0 with STAN_NO_RANGE_CHECKS=true along with STAN_THREADS=true and STAN_CPP_OPTIMS=true in make/local. When I ran the same fake data linear regression model the newly compiled executable did not run threaded even though I included cpp_options=list(stan_threads=TRUE) in the call to cmdstan_model(). It did run the 4 chains in parallel as requested, but the CPU was only using 4 cores. I have a 16 physical core CPU that benefits hugely from threading when available.

Reverting back to the previous cmdstan build restores threading, but with the same execution time penalty. Thanks for the link to the PR. This one seems relevant too. I will keep investigating.

Topic		Replies	Views
Runtime difference between cmdstan v2.28.2 and v2.27.0 CmdStan	7	957	December 10, 2021
Stan 2.17 running slower on a model than Stan 2.15 General	53	4020	November 7, 2017
Rstan versus cmdstan run times Developers cmdstan , rstan , cmdstanr	1	541	December 29, 2023
Large Cmdstan performance differences Windows vs. Linux Developers	39	5407	August 9, 2022
Cmdstan cluster sampling speed CmdStan	3	78	January 10, 2025

Disappointing execution time in cmdstan 2.27.0 vs. 2.25.0

Related topics