Slower sampling after rstan update

I just upgraded Rstan from 2.26.23 to 2.32.5. I have a model, which I run with a custom init function for initial values, and a set seed in brm(). I noticed that after the update, the same model for the same data samples 20% slower. I re-ran a few other models that I had scripts with seeds set-up, and for all of them the sampling is 20% slower. Is this normal? Or is this a fluke - maybe setting the seed does not reproduce sampling behavior in different versions?

This is my sessionInfo after the update (no other updates besides Rstan):

sessionInfo()
R version 4.3.1 (2023-06-16)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.6 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0 
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

locale:
 [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8        LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8    LC_PAPER=C.UTF-8       LC_NAME=C             
 [9] LC_ADDRESS=C           LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   

time zone: Etc/UTC
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] rstan_2.32.5       StanHeaders_2.32.5 brms_2.20.4        Rcpp_1.0.11       

loaded via a namespace (and not attached):
 [1] tidyselect_1.2.0     dplyr_1.1.3          farver_2.1.1         loo_2.6.0            fastmap_1.1.1        tensorA_0.36.2       shinystan_2.6.0      promises_1.2.1       shinyjs_2.1.0       
[10] digest_0.6.33        timechange_0.2.0     mime_0.12            lifecycle_1.0.3      ellipsis_0.3.2       processx_3.8.2       magrittr_2.0.3       posterior_1.4.1      compiler_4.3.1      
[19] rlang_1.1.1          tools_4.3.1          igraph_1.5.1         utf8_1.2.3           prettyunits_1.2.0    bridgesampling_1.1-2 htmlwidgets_1.6.2    pkgbuild_1.4.2       plyr_1.8.9          
[28] dygraphs_1.1.1.6     abind_1.4-5          miniUI_0.1.1.1       grid_4.3.1           stats4_4.3.1         fansi_1.0.5          xts_0.13.1           xtable_1.8-4         colorspace_2.1-0    
[37] inline_0.3.19        ggplot2_3.4.3        scales_1.2.1         gtools_3.9.4         cli_3.6.1            mvtnorm_1.2-3        crayon_1.5.2         generics_0.1.3       RcppParallel_5.1.7  
[46] rstudioapi_0.15.0    reshape2_1.4.4       stringr_1.5.0        shinythemes_1.2.0    bayesplot_1.10.0     parallel_4.3.1       matrixStats_1.0.0    base64enc_0.1-3      vctrs_0.6.3         
[55] Matrix_1.6-1.1       jsonlite_1.8.7       callr_3.7.3          crosstalk_1.2.0      glue_1.6.2           codetools_0.2-19     ps_1.7.5             DT_0.30              distributional_0.3.2
[64] lubridate_1.9.3      stringi_1.7.12       gtable_0.3.4         later_1.3.1          QuickJSR_1.0.6       munsell_0.5.0        tibble_3.2.1         colourpicker_1.3.0   pillar_1.9.0        
[73] htmltools_0.5.6.1    Brobdingnag_1.2-9    R6_2.5.1             shiny_1.7.5          lattice_0.21-9       markdown_1.10        backports_1.4.1      threejs_0.3.3        httpuv_1.6.11       
[82] rstantools_2.3.1.1   coda_0.19-4          gridExtra_2.3        nlme_3.1-163         checkmate_2.2.0      zoo_1.8-12           pkgconfig_2.0.3     

Ok, this is actually becoming quite a substantial problem. I have a custom model that samples in cmdstanr in about 90 minutes, while it needs 7 hours in rstan. The posterior estimates are the same, and I have set the same seeds, priors and inits, etc, it’s just the samping speed is drastically different. This happens both on my personal windows machine, and on our ubuntu computing server. Even something as simple as:

library(brms)
library(cmdstanr)
library(rstan)
options(mc.cores =  parallel::detectCores())

set.seed(512325)

y <- rnorm(100000,3,1.4)
dat <- data.frame(y=y)


mlcmd <- brm(bf(y ~ 1, sigma ~ 1),
             data=dat, init=0,
             backend = 'cmdstanr', seed = 6254)

mlrstan <- brm(bf(y ~ 1, sigma ~ 1),
               data=dat, seed = 6254, init=0,
               backend = 'rstan')

print(rstan::get_elapsed_time(mlcmd$fit))
print(rstan::get_elapsed_time(mlrstan$fit))

Gives

        warmup sample
chain:1 52.064 45.881
chain:2 51.975 51.339
chain:3 50.823 52.220
chain:4 53.228 54.601
        warmup sample
chain:1 86.080 78.944
chain:2 81.721 69.989
chain:3 79.837 74.472
chain:4 78.578 72.175

Here is the session info for my windows machine:

R version 4.3.1 (2023-06-16 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 11 x64 (build 22621)

Matrix products: default


locale:
[1] LC_COLLATE=English_United States.utf8  LC_CTYPE=English_United States.utf8    LC_MONETARY=English_United States.utf8 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.utf8    

time zone: Europe/Zurich
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] rstan_2.32.5       StanHeaders_2.32.5 cmdstanr_0.7.1     brms_2.20.4        Rcpp_1.0.10       

loaded via a namespace (and not attached):
 [1] tidyselect_1.2.0     dplyr_1.1.2          farver_2.1.1         loo_2.6.0.9000       fastmap_1.1.1        tensorA_0.36.2       shinystan_2.6.0      promises_1.2.0.1     shinyjs_2.1.0       
[10] digest_0.6.31        mime_0.12            lifecycle_1.0.3      ellipsis_0.3.2       processx_3.8.1       magrittr_2.0.3       posterior_1.4.1      compiler_4.3.1       rlang_1.1.1         
[19] tools_4.3.1          igraph_1.5.0         utf8_1.2.3           data.table_1.14.8    knitr_1.43           labeling_0.4.2       prettyunits_1.1.1    bridgesampling_1.1-2 htmlwidgets_1.6.2   
[28] curl_5.0.1           pkgbuild_1.4.1       plyr_1.8.8           dygraphs_1.1.1.6     abind_1.4-5          miniUI_0.1.1.1       withr_2.5.0          grid_4.3.1           stats4_4.3.1        
[37] fansi_1.0.4          xts_0.13.1           xtable_1.8-4         colorspace_2.1-0     inline_0.3.19        ggplot2_3.4.2        scales_1.2.1         gtools_3.9.4         cli_3.6.1           
[46] mvtnorm_1.2-2        crayon_1.5.2         generics_0.1.3       RcppParallel_5.1.7   rstudioapi_0.14      reshape2_1.4.4       stringr_1.5.0        shinythemes_1.2.0    bayesplot_1.10.0    
[55] parallel_4.3.1       matrixStats_1.0.0    base64enc_0.1-3      vctrs_0.6.3          V8_4.3.0             Matrix_1.5-4.1       jsonlite_1.8.5       callr_3.7.3          crosstalk_1.2.0     
[64] glue_1.6.2           codetools_0.2-19     ps_1.7.5             DT_0.28              distributional_0.3.2 stringi_1.7.12       gtable_0.3.3         later_1.3.1          QuickJSR_1.0.6      
[73] munsell_0.5.0        tibble_3.2.1         colourpicker_1.2.0   pillar_1.9.0         htmltools_0.5.5      Brobdingnag_1.2-9    R6_2.5.1             shiny_1.7.4          lattice_0.21-8      
[82] markdown_1.7         backports_1.4.1      threejs_0.3.3        httpuv_1.6.11        rstantools_2.3.1     coda_0.19-4          gridExtra_2.3        nlme_3.1-162         checkmate_2.2.0     
[91] xfun_0.39            zoo_1.8-12           pkgconfig_2.0.3     

And for the ubuntu machine:

R version 4.3.1 (2023-06-16)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.6 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0 
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

locale:
 [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8        LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8    LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
[10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   

time zone: Etc/UTC
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] rstan_2.32.5       StanHeaders_2.32.5 cmdstanr_0.7.1     brms_2.20.4        Rcpp_1.0.11       

loaded via a namespace (and not attached):
 [1] tidyselect_1.2.0     dplyr_1.1.3          farver_2.1.1         loo_2.6.0            fastmap_1.1.1        tensorA_0.36.2       shinystan_2.6.0      promises_1.2.1       shinyjs_2.1.0        digest_0.6.33       
[11] timechange_0.2.0     mime_0.12            lifecycle_1.0.3      ellipsis_0.3.2       processx_3.8.2       magrittr_2.0.3       posterior_1.4.1      compiler_4.3.1       rlang_1.1.1          tools_4.3.1         
[21] igraph_1.5.1         utf8_1.2.3           knitr_1.44           prettyunits_1.2.0    bridgesampling_1.1-2 htmlwidgets_1.6.2    pkgbuild_1.4.2       plyr_1.8.9           dygraphs_1.1.1.6     abind_1.4-5         
[31] miniUI_0.1.1.1       grid_4.3.1           stats4_4.3.1         fansi_1.0.5          xts_0.13.1           xtable_1.8-4         colorspace_2.1-0     inline_0.3.19        ggplot2_3.4.3        scales_1.2.1        
[41] gtools_3.9.4         cli_3.6.1            mvtnorm_1.2-3        crayon_1.5.2         generics_0.1.3       RcppParallel_5.1.7   rstudioapi_0.15.0    reshape2_1.4.4       stringr_1.5.0        shinythemes_1.2.0   
[51] bayesplot_1.10.0     parallel_4.3.1       matrixStats_1.0.0    base64enc_0.1-3      vctrs_0.6.3          Matrix_1.6-1.1       jsonlite_1.8.7       callr_3.7.3          crosstalk_1.2.0      glue_1.6.2          
[61] codetools_0.2-19     ps_1.7.5             DT_0.30              distributional_0.3.2 lubridate_1.9.3      stringi_1.7.12       gtable_0.3.4         later_1.3.1          QuickJSR_1.0.6       munsell_0.5.0       
[71] tibble_3.2.1         colourpicker_1.3.0   pillar_1.9.0         htmltools_0.5.6.1    Brobdingnag_1.2-9    R6_2.5.1             shiny_1.7.5          lattice_0.21-9       markdown_1.10        backports_1.4.1     
[81] threejs_0.3.3        httpuv_1.6.11        rstantools_2.3.1.1   coda_0.19-4          gridExtra_2.3        nlme_3.1-163         checkmate_2.2.0      xfun_0.40            zoo_1.8-12           pkgconfig_2.0.3   

@andrjohns @bgoodri @hsbadr

I just tested it on a brand new server instance, installing everything from scratch and updating to R4.3.2 (it was 4.3.1) before. Same thing:

rstan sampling time: 157.217 seconds
cmdstanr sampling time: 97.031 seconds

cmdstanr testing code:

library(brms)
library(cmdstanr)
library(rstan)
options(mc.cores =  parallel::detectCores())

set.seed(512325)

y <- rnorm(100000,3,1.4)
dat <- data.frame(y=y)


stan_code <- make_stancode(bf(y ~ 1, sigma ~ 1), data=dat)
stan_data <- make_standata(bf(y ~ 1, sigma ~ 1), data=dat)

model <- cmdstanr::cmdstan_model(write_stan_file(stan_code))
cmd_fit <- model$sample(data = stan_data, 
                        init = 0,
                        chains = 1, 
                        iter_warmup = 1000,
                        iter_sampling = 1000,
                        refresh = 100, 
                        seed=346)

print(cmd_fit$time())

rstan testing code:

library(brms)
library(cmdstanr)
library(rstan)
options(mc.cores =  parallel::detectCores())
rstan_options(auto_write = TRUE)

set.seed(512325)

y <- rnorm(100000,3,1.4)
dat <- data.frame(y=y)


stan_code <- make_stancode(bf(y ~ 1, sigma ~ 1), data=dat)
stan_data <- make_standata(bf(y ~ 1, sigma ~ 1), data=dat)

model <- stan_model(model_code=stan_code)
rstan_fit <- sampling(model, data = stan_data, 
                      init = 0,
                      chains = 1, 
                      iter = 2000,
                      seed=346)
print(get_elapsed_time(rstan_fit))

So this is interesting, I checked how O1 optimization affects both backends. I set “allow_optimizations=T” for the rstan compile options, and stanc_options = list(“O1”) for the cmdstanr options.

The improvement is dramatic for both cases, but much larger for rstan, and now rstan samples faster!

Without optimization:

  • rstan: 157 seconds
  • cmdstanr: 97 seconds

With 01 optimization flag:

  • rstan: 27 seconds
  • cmdstanr: 34 seconds

Unfortunately for my custom model, adding o1 optimization does nothing for the rstan sampling, and slows down cmdstanr several times (!) to the time for rstan… I’m very confused

How big are data going in and coming out?

Without seeing your model code, it’s difficult to say why optimizations would behave as they do. There are a few known instances where the optimizations actually make things slower, which is why we always encourage timing them.

1 Like

Sorry, I was incorrect about the effect of optimizations on my model. This was based on my memory of a previous model version. I just refit it and o1 does nothing when fitting my model with cmdstanr, but for rstan it slows it even further. Timings of custom model:

cmdstanr: 240.2 s.
cmdstanr, O1: 241.3 s.
rstan: 538.1 s.
rstan, O1: 6264.9 sec (!!!)

I can post a code sample, but in any case, my point is that this is not specific to my model, rstan runs much slower even for the simple example above

UPDATE:

while O1 did not have an effect on cmdstanr in the above example, running the same code on a different server instance with the just released cmdstanr 2.34 version O1 also slowed down cmdstanr several times as well.

UPDATE2:

ok, figured it out, in the first instance O1 had no effect, because the model was not recompiled. So, in summary, O1 slows down both cmdstanr and rstan (for my model more than 10 times!. I will look into the link you posted about known cases

I wonder if Rstan 2.26 - threads_per_chain always set to max even with rstan_options( threads_per_chain = 1 ) - #6 by stemangiola is related? Do you see any interesting CPU usage differences between rstan and cmdstanr?

I will check. In the meantime I’m getting some weird things when trying to figure out why O1 optimization hurts speed in my model. Since my model is complex, I’m running tests on a simplified situation - a hierarhical linear regression, with a custom function for the normal_lpdf.

Here is the stan model. The code was generated with brms, and I replace normal_lpdf with my_normal_lpdf:

  // generated with brms 2.20.4
  functions {
    real my_normal_lpdf(vector y, vector mu, vector sigma) {
      real out = sum(-(y-mu)^2 ./ sigma^2 / 2 - log(sigma) - 0.5*log(2*pi()));
      return(out);
    }
  }
  data {
    int<lower=1> N;  // total number of observations
    vector[N] Y;  // response variable
    // data for group-level effects of ID 1
    int<lower=1> N_1;  // number of grouping levels
    int<lower=1> M_1;  // number of coefficients per level
    array[N] int<lower=1> J_1;  // grouping indicator per observation
    // group-level predictor values
    vector[N] Z_1_1;
    int prior_only;  // should the likelihood be ignored?
  }
  transformed data {
  }
  parameters {
    real Intercept;  // temporary intercept for centered predictors
    real Intercept_sigma;  // temporary intercept for centered predictors
    vector<lower=0>[M_1] sd_1;  // group-level standard deviations
    array[M_1] vector[N_1] z_1;  // standardized group-level effects
  }
  transformed parameters {
    vector[N_1] r_1_1;  // actual group-level effects
    real lprior = 0;  // prior contributions to the log posterior
    r_1_1 = (sd_1[1] * (z_1[1]));
    lprior += student_t_lpdf(Intercept | 3, 9.6, 3.4);
    lprior += student_t_lpdf(Intercept_sigma | 3, 0, 2.5);
    lprior += student_t_lpdf(sd_1 | 3, 0, 3.4)
    - 1 * student_t_lccdf(0 | 3, 0, 3.4);
  }
  model {
    // likelihood including constants
    if (!prior_only) {
      // initialize linear predictor term
      vector[N] mu = rep_vector(0.0, N);
      // initialize linear predictor term
      vector[N] sigma = rep_vector(0.0, N);
      mu += Intercept;
      sigma += Intercept_sigma;
      for (n in 1:N) {
        // add more terms to the linear predictor
        mu[n] += r_1_1[J_1[n]] * Z_1_1[n];
      }
      sigma = exp(sigma);
      target += my_normal_lpdf(Y | mu, sigma);
    }
    // priors including constants
    target += lprior;
    target += std_normal_lpdf(z_1[1]);
  }
  generated quantities {
    // actual population-level intercept
    real b_Intercept = Intercept;
    // actual population-level intercept
    real b_sigma_Intercept = Intercept_sigma;
  }

For this model I get running times with cmdstanr:
Without optimization: 113 sec
With 01 optimization: 86 sec

So far so good!

But then, if we change the likelihood function minimally, eg:

      real out = sum(-(y-mu)^2 ./ sigma^2 / 2 - log(sigma) - 0.5*log(2*pi()));

to

      real out = sum(-(y-mu)^2 ./ sigma^2 / 2 - 0.5*log(2*pi()) - log(sigma));

(just switching the order of operations; yes, I know I can take the constant term out of the sum entirely)

I get running times
Without optimization: 94 sec
With 01 optimization: 145 sec

This is very strange. In summary, we have speed up with O1 for the first model, slow down for the second, and just a minor difference between them:

            model optim total_time
1 sigma_before_pi  NULL  113.17707
2 sigma_before_pi    O1   86.81143
3  sigma_after_pi  NULL   97.73978
4  sigma_after_pi    O1  145.26210

Attached is the full R code to generate this. I will now try to see if there is something similar in my own custom model. But this behavior is puzzling - that such a small change to the stan code has such a dramatic effect on whether optimization helps or hurts.

test_cmdstanr9.R (6.5 KB)

That sounds like an old bug, but doesn’t seem to be the same from what I can tell locally. The primary optimization is what we call the “matrix memory pattern optimization” and it has the same outcome in both of those models for me. In fact, the only difference in the generated C++ between the two is the swapped order of those two functions.

I ran the timings myself, with slightly less dramatic results:

    sigma_before_pi   sigma_after_pi
O0  66                64
O1  78                65

I’m not really willing to draw that much of a conclusion off of the difference I observed on my machine

The fact that (without optimizations) your timings were differing by 16 seconds between the two models suggests there is a lot of other processes running on your machine, or perhaps you have hardware with a fast core/slow core distinction. Inconsistent timings could account for the different in rstan as well, but it seems pretty dramatic and repeatable, so I doubt it in this case

This doesn’t directly help, but have you considered moving to cmdstanr?

Also, the reduce_sum function can do wonders by allowing within chain parallelization.

I think you are right for this example. I couldnt replicate the results today either. But for my model it is reliable, tested each version several times today on a separate but identical server instance. I’ll try to get a minimally reproducible example. Thanks a lot for looking into this!

Personally I did after this. But the model will be deployed in a package that uses brms to generate the linear model syntax for the code, and I’d like not to force users to switch.

1 Like