Rstan 2.19.2 slower than 2.18.1

I have recently noticed that my tests take longer to run and it appears to be due to the latest version of rstan (2.19.2). To verify this, I ran some of our Stan code with versions 2.18.1 and 2.19.2 for 10 different seeds, and observed that the latest version was consistently running around 10% slower than the previous version with 2 chains and 200 iterations. To minimise measurement noise I ran the calculations from safe mode in Windows. My R code is below along with two CSV attachments containing data, which should provide a runnable example. Has anyone else noticed a difference in performance?

data <- list(C = 3L, R = 380L, RS = 3040L, V = 13,
             prior_mean = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0),
             prior_sd = c(5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5))

data$X <- array(c(as.matrix(read.csv("xdata.csv"))), dim = c(3040, 3, 13))
data$Y <- read.csv("ydata.csv")[[1]]
data$S <- rep(8, 380)

options(mc.cores=parallel::detectCores())

library(rstan)
stan.model <- stan_model("hb.stan")
result <- sampling(stan.model, data = data,
                      chains = 2, iter = 200,
                      control = list(max_treedepth = 10,
                                     adapt_delta = 0.8),
                      init = function () list(L_omega = diag(13)))

xdata.csv (328.5 KB) ydata.csv (8.9 KB)hb.stan (1014 Bytes)

I assume this is something that would concern Stan developers and users who care about performance, as the difference in run time could be hours with larger examples?

I’m sorry to confirm this report using CmdStan 2.18.1, 2.19.1, and 2.20.0

for CmdStan, I generated two Rdump files - hb.data.R and hb.init.R - I hope that the latter corresponds to the RStan init function.

I ran this on my machine a few times for each release - one chain only, same seed, and got consistent results:

time ./hb random seed=123456 data file=hb.data.R init=hb.init.R sample num_samples=200 num_warmup=200 adapt delta=0.8

2.18.1:

 Elapsed Time: 177.458 seconds (Warm-up)
               56.1166 seconds (Sampling)
               233.575 seconds (Total)

2.20.0

 Elapsed Time: 190.415 seconds (Warm-up)
               60.1722 seconds (Sampling)
               250.588 seconds (Total)

I compared the C++ code for the model - all the calls to the math library are the same - attached is an inventory of calls to mathlib in the model.
has anything changes w/r/t these calculations? cholesky factor code? std_normal?
hb-hpp-stan-math-calls.txt (4.1 KB) hb.data.R (461.4 KB) hb.init.R (559 Bytes)

@seantalts and @syclik - any thoughts on what’s slower?

There were recent changes to the beta function and lgamma was switched from boost:: to std::, the latter one known to cause a performance regression for the threaded case (there is a PR up https://github.com/stan-dev/math/pull/1255). I think it should be between these two.

From my notebook:
on develop:
Elapsed Time: 164.59 seconds (Warm-up)
36.4511 seconds (Sampling)
201.042 seconds (Total)

Stan Math on the bugfix/issue-1250-lgamma branch from @wds15:

Elapsed Time: 140.701 seconds (Warm-up)
38.5668 seconds (Sampling)
179.268 seconds (Total)

I am fairly sure that is it Its more of hunch, but someone should double check, my timing results seem quite noisy after I repeated timing a few times. Anyhow that PR should get merged ASAP anyways to at least remove the threading issue on develop, as Sebastian mentioned in the meeting today.

I had the same thought and did as well benchmarks which show (I use only 100 warmup + 100 iterations):

2.18.1
 Elapsed Time: 104.734 seconds (Warm-up)
               85.1479 seconds (Sampling)
               189.882 seconds (Total)

develop:
 Elapsed Time: 107.946 seconds (Warm-up)
               94.6883 seconds (Sampling)
               202.634 seconds (Total)

bugfix/issue-1250-lgamma where I just merged in develop:
 Elapsed Time: 93.0542 seconds (Warm-up)
               83.8694 seconds (Sampling)
               176.924 seconds (Total)

The lgamma PR brings a faster log gamma function and also speeds up the digamma function (almost 2x). This model seems to benefit from these changes; so merging the PR sounds great.

Thanks everyone for looking into this! I’m glad that the issue could be replicated and a fix has been found. Could someone provide an indication of when this fix will be released in rstan?

The next release is October 18th 2019 and historically RStan trails Stan releases by about 3 months, so I’d guess in about 6 months. To get access to the fix sooner, once that PR is merged you could use CmdStan or CmdStanPy with the develop branch. We’re hoping to develop a CmdStanR version as well that will enable similar functionality on R.

It might also be worth to try installing StanHeaders from GitHub, after the bug fix is in math/develop. This has worked previously for me.

Does lchoose internally call lgamma and hence would be affected by this?

Yes.

lchoose(a, b)
  = lgamma(a + 1) - lgamma(b + 1) - lgamma(a - b + 1);

@bgoodri do you have pointers on how to get that bug fix for a user running Rstan?

Or is there a way to go back to 2.18.1 till the bug is fixed?

@Raghu_Naik, you can use devtools to install older versions from CRAN or GitHub.

Do you have more precise measurements of how much faster the new PR is? I have some code which depends heavily on lchoose and hence am quite excited to get this fix. Fingers crossed the PR gets merged soon! Thanks!