Disappointing execution time in cmdstan 2.27.0 vs. 2.25.0

I recently got Ubuntu 20.04 running under WSL2 on one of my Windows 10 PCs and have been doing some performance comparisons. It appears that CmdStan 2.27.0’s execution time takes from 25% to 100% or more longer than 2.25.0 on the same model with the same data. Here is a sample Stan model:

functions {
  real partial_sum(int[] ind, int start, int end, vector y, matrix X,
                   real a, vector b, real sigma) {
  return normal_id_glm_lpdf(y[start:end] | X[start:end,], a, b, sigma);
  }
}
data {
  int<lower=0> N;
  int<lower=0> M;
  matrix[N, M] X;
  vector[N] y;
}
transformed data {
  int grainsize = 1;
  int ind[N] = rep_array(1, N);
}
parameters {
  real a;
  real<lower=0> sigma;
  vector[M] b;
}
model { 
  a ~ normal(0, 5);
  sigma ~ normal(0, 10);
  b ~ std_normal();
  
  target += reduce_sum(partial_sum, ind, grainsize, y, X, a, b, sigma);
}

I create fake data for it with the following R function:

fake_multi <- function(N=100000, M=50, alpha=1, sigma=0.25) {
  X <- matrix(rnorm(N*M), N, M)
  beta <- rnorm(M)
  y <- as.vector(alpha + X %*% beta + rnorm(N, sd=sigma))
  data_list <- list(N=N, M=M, X=X, y=y)
  list(data_list=data_list, beta=beta)
}

I have the release source distributions of cmdstan-2.25.0 and cmdstan-2.27.0 and both were built with the same compiler flags, specifically STAN_THREADS=true and STAN_CPP_OPTIMS=true.

I ran the models using cmdstanr with, for example

time_t_25 <- system.time(reg_t_25 <- multireg2t$sample(reg_dat, chains=4, parallel_chains=4, threads_per_chain=4, seed
=987654L))

then deleted the executable and recompiled with cmdstan-2.27.0. Here are some time comparisons:

> reg_t_25$time()
$total
[1] 79.65344

$chains
  chain_id warmup sampling  total
1        1 33.480   25.456 58.936
2        2 43.502   23.439 66.941
3        3 33.337   25.780 59.117
4        4 59.370   18.864 78.234

> reg_t_27$time()
$total
[1] 100.9266

$chains
  chain_id warmup sampling  total
1        1 42.651   32.131 74.782
2        2 53.790   30.842 84.632
3        3 43.951   33.104 77.055
4        4 72.114   27.417 99.531

The difference is even larger for the models I’m actually running at the moment. Here are some timing comparisons running in both WSL and Windows on the same PC – the Win side has Rtools40 installed. I have gcc 9.3 in WSL. These are also run with cmdstanr and then converted to stanfit objects:

> get_elapsed_time(sfit.146$stanfit)
         warmup  sample
chain:1 2408.44 1791.58
chain:2 2340.84 1795.19
chain:3 2405.28 1810.34
chain:4 2469.86 1814.97
> get_elapsed_time(sfit.146_25$stanfit)
         warmup  sample
chain:1 1143.73 898.871
chain:2 1098.66 900.943
chain:3 1238.32 878.939
chain:4 1218.81 871.352
> get_elapsed_time(sfit.146_win$stanfit)
         warmup  sample
chain:1 4590.94 3262.52
chain:2 3844.77 5569.48
chain:3 3885.73 2995.80
chain:4 3846.82 5108.44
> get_elapsed_time(sfit.146_win_25$stanfit)
         warmup  sample
chain:1 1369.74 1093.47
chain:2 1475.25 1065.62
chain:3 1377.00 1671.71
chain:4 1401.22 1062.80
>

sessionInfo() for the fake data runs:

> sessionInfo()
R version 4.1.0 (2021-05-18)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.2 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

Random number generation:
 RNG:     Mersenne-Twister
 Normal:  Inversion
 Sample:  Rounding

locale:
 [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8
 [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8
 [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C
[10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] cmdstanr_0.4.0.9000 zernike_3.7.1

loaded via a namespace (and not attached):
 [1] rstan_2.21.2         tidyselect_1.1.1     xfun_0.24
 [4] purrr_0.3.4          V8_3.4.2             colorspace_2.0-2
 [7] vctrs_0.3.8          generics_0.1.0       stats4_4.1.0
[10] loo_2.4.1            utf8_1.2.2           rlang_0.4.11
[13] pkgbuild_1.2.0       pillar_1.6.1         glue_1.4.2
[16] withr_2.4.2          distributional_0.2.2 matrixStats_0.60.0
[19] lifecycle_1.0.0      posterior_1.0.1      munsell_0.5.0
[22] gtable_0.3.0         codetools_0.2-18     inline_0.3.19
[25] knitr_1.33           callr_3.7.0          ps_1.6.0
[28] curl_4.3.2           parallel_4.1.0       fansi_0.5.0
[31] Rcpp_1.0.7           scales_1.1.1         backports_1.2.1
[34] checkmate_2.0.0      RcppParallel_5.1.4   StanHeaders_2.21.0-7
[37] jsonlite_1.7.2       abind_1.4-5          farver_2.1.0
[40] gridExtra_2.3        tensorA_0.36.2       ggplot2_3.3.5
[43] processx_3.5.2       dplyr_1.0.7          grid_4.1.0
[46] cli_3.0.1            tools_4.1.0          magrittr_2.0.1
[49] tibble_3.1.3         crayon_1.4.1         pkgconfig_2.0.3
[52] ellipsis_0.3.2       data.table_1.14.0    prettyunits_1.1.1
[55] R6_2.5.0             compiler_4.1.0

I have to say I was a little skeptical that R and Stan would perform better in WSL than in Windows on the same hardware, but I’m happy to have proved myself wrong. Now that I also have an X server working I have no reason to go back to R on Windows.

1 Like

Index checking is on by default in 2.27…you can turn it off, please dig in the release notes…

@Michael_Peck yes I would bet that @wds15 is right, check the PR here for more info

So once a model is debugged, what set of flags achieve max performance? I’ve seen STAN_NO_RANGE_CHECKS, STAN_CPP_OPTIMS, and @stevebronder mentioned elsewhere that -march=native -mtune=native -O3 -g0 should also be used. Anything else?

OK, thanks for the quick response. I rebuilt 2.27.0 with STAN_NO_RANGE_CHECKS=true along with STAN_THREADS=true and STAN_CPP_OPTIMS=true in make/local. When I ran the same fake data linear regression model the newly compiled executable did not run threaded even though I included cpp_options=list(stan_threads=TRUE) in the call to cmdstan_model(). It did run the 4 chains in parallel as requested, but the CPU was only using 4 cores. I have a 16 physical core CPU that benefits hugely from threading when available.

Reverting back to the previous cmdstan build restores threading, but with the same execution time penalty. Thanks for the link to the PR. This one seems relevant too. I will keep investigating.