RStan is slower than Pystan


I had to do stan_rdump from python because the R version was silently failing to create a data file, but in the end cmdstand version reports the same timings as pystan.


Thanks a lot. Not good.

We will need to look into this. If you can share your example (with any data) that would be great; otherwise it takes a moment longer to dive in.


I’m trying to. The first roadblock I ran into is that the original script doesn’t run. When I source it, the result is:

> source('slow.R')
Loading required package: lattice
Loading required package: ggplot2
Need help? Try the ggplot2 mailing list:
Loading required package: StanHeaders
rstan (Version 2.17.3, GitRev: 2e1f913d3ca3)
For execution on a local, multicore CPU with excess RAM we recommend calling
options(mc.cores = parallel::detectCores()).
To avoid recompilation of unchanged Stan programs, we recommend calling
rstan_options(auto_write = TRUE)
Error in `[<-`(`*tmp*`, i, ((i - 1) * length(h3) + 1):(i * length(h3)),  : 
  subscript out of bounds


Summary: I ran it on Mac OS X with RStan 2.17.3 and Apple LLVM version 7.0.2 (clang-700.1.81) and it ran roughly the same speed as reported here for PyStan.

I couldn’t understand your data simulation script, so I wrote one directly in Stan, implementing the half normal with fabs.

data {
  int K;
  int N;
  int P;
  real a;
  real b;
  real alpha;
  real lambda;
generated quantities {
  real<lower=0> sigma;
  simplex[K] W[N];
  vector<lower=0>[K] H[P];
  real y[N, P];

  sigma = inv_gamma_rng(a, b);
  for (n in 1:N)
    W[n] = dirichlet_rng(rep_vector(alpha, K));
  for (p in 1:P)
    for (k in 1:K)
      H[p, k] = exponential_rng(lambda);
  for (n in 1:N)
    for (p in 1:P)
      y[n, p] = fabs(normal_rng(dot_product(W[n], H[p]), sigma));

I then simulated and fit using the OP’s model posted above with this:


consts_hyperparams <- list(K = 3, N = 100, P = 10,
                           a = 1, b = 1, alpha = 1, lambda = 0.1)
sim <- stan("sim.stan", data = consts_hyperparams,
            algorithm = "Fixed_param",
            chains = 1, warmup = 0, iter = 1)
y_sim <- extract(sim)$y[1, , ]


data_sim <- list(K = 3, N = 100, P = 10,
                 a = 1, b = 1, dir_alpha = 1, exp_lambda = 0.1,
                 y = abs(y_sim))

fit <- stan("slow.stan", data = data_sim, chains = 1, iter = 2000)

Result is inline with what you reported for PyStan:

 Elapsed Time: 17.2228 seconds (Warm-up)
               11.1658 seconds (Sampling)
               28.3886 seconds (Total)

This is on my now ancient Macbook Pro (Retina mid-2012, 2.3GHz i7, 6 GB 1600 MHz DDR3). I’m running Xcode (probably old because I’m still on the last OS):

~/temp2/slow$ clang++ --version
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin14.5.0
Thread model: posix

And here’s my session information:

> sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X Yosemite 10.10.5

[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] rstan_2.17.3       StanHeaders_2.17.2 ggplot2_2.2.1     

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.15     codetools_0.2-15 grid_3.3.2       plyr_1.8.4       gtable_0.2.0     stats4_3.3.2    
 [7] scales_0.4.1     rlang_0.1.4      lazyeval_0.2.0   tools_3.3.2      munsell_0.4.3    inline_0.3.14   
[13] colorspace_1.3-2 gridExtra_2.2.1  tibble_1.3.4    

It also roughlyr recovered sigma, though not quite in the 95% posterior interval:

> extract(sim)$sigma
[1] 4.07

> print(fit, "sigma", probs=c(0.025, 0.5, 0.975))
      mean se_mean  sd 2.5%  50% 97.5% n_eff Rhat
sigma 3.76       0 0.1 3.57 3.75  3.95  1000    1

Nothing I’d worry about with this kind of model.


As an added bonus, here’s how to code the model to run more than twice as fast:

functions {
  matrix to_matrix(vector[] a) {
    int M = size(a);
    int N = rows(a[1]);
    matrix[M, N] a_mat;
    for (m in 1:M)
      for (n in 1:N)
        a_mat[m, n] = a[m, n];
    return a_mat;
data {
  int<lower=0> N;
  int<lower=0> P;
  int<lower=0> K;
  matrix<lower = 0>[N, P] y;
  real<lower=0> exp_lambda;
  real<lower=0> dir_alpha;
  real<lower=0> a;
  real<lower=0> b;
transformed data{
  vector<lower=0>[K] alpha = rep_vector(dir_alpha, K);
  simplex[K] W[N];
  matrix<lower = 0>[K, P] H;
  real<lower=0> sigma;
  sigma ~ inv_gamma(a,b);
  to_vector(H) ~ exponential(exp_lambda);
  to_vector(y) ~ normal(to_vector(to_matrix(W) * H), sigma);

The runtime is:

 Elapsed Time: 6.49639 seconds (Warm-up)
               3.8764 seconds (Sampling)
               10.3728 seconds (Total)

and the result’s the same; the previous code took 28.3886 seconds (Total).

The trick is converting the W * H to matrix multiplication (much faster than M * N dot products because there are really good matrix multiply algorithms—the savings are much bigger for larger matrices). I also dropped the Dirichlet prior because it’s uniform and that’s the default. And I vectorized the other distributions. The copies involved in to_vector and whatnot are cheap compared to the savings.

It’s clunky to do these data-type conversions. We need to the to_matrix function in Stan that I implemented here as a function.

Sorry I don’t know what’s wrong with your install of RStan. This isn’t a particularly intensive model at this data size.


Dear all, Thank you for your help and good discussion. I am trying to do the same calculation on another computer and the result of pystan and rstan became close enough. So I think there is something wrong with my computer. However, when I tried again in the previous computer but in this time, I deleted the .rds file from the previous compilation of stan file and recompile it again Now the speed of rstan become normal again.

So I think the previous rds file was from unoptimized code loaded automatically when I start rstan even though I already changed my stan file. But in pystan, you don’t have autowrite option so it will recompile the stan file again after I changed the stan file. Or maybe my rds file is just corrupted.


Here is an example that exhibits the same behavior for me.

model.stan (710 Bytes)
sample_data.txt (43.0 KB)

sample_data.txt should be renamed to .json – current discourse settings do not let you upload json files.

R code:

> library("rstan")
Loading required package: ggplot2
Loading required package: StanHeaders
rstan (Version 2.17.3, GitRev: 2e1f913d3ca3)
> library("rjson")
> data <- fromJSON(file="sample_data.json")
> fit <- stan("model.stan", chains=1, iter=1000, warmup=400, data = data)


Gradient evaluation took 0.006025 seconds
1000 transitions using 10 leapfrog steps per transition would take 60.25 seconds.
Adjust your expectations accordingly!


> import pystan
> import json
> data = json.load(open("sample_data.json"))
> model = pystan.StanModel(file="model.stan")
> fit = model.sampling(data=data, iter=1000, warmup=400, chains=1)

Gradient evaluation took 0.000489 seconds
1000 transitions using 10 leapfrog steps per transition would take 4.89 seconds.
Adjust your expectations accordingly!

Given what was written above, I will try on a different machine as soon as I have a chance


Thanks Bob for following up in this!


As an update: on a different (linux) box my example ran in the same time with both rstan and python, so it looks to be some system-specific problem on my mac laptop.


Mind sharing the configuration? It would help us, the developers, out if we understood how this behavior is happening.

That’s essentially whatever information you have about:

  • operating system version
  • c++ compiler including version
  • if you have a second c++ compiler installed
  • r version
  • RStan version
  • r makefiles
  • Python version
  • PyStan version
  • any specific build instructions for PyStan

I think if we had that we could try to reproduce it.

The simplest reasons this could different:

  • different compilers being used
  • different optimization levels used to compile the C++ (it should be at level 3 without the debug flag)

After that, it’s a guessing game.


I ran this on my mac and did not observe any difference in performance between the interfaces.

Operating system: OS 10.13.3
gcc --version : Apple LLVM version 9.0.0 (clang-900.0.39.2)
R version: 3.4.3
RStan version: 2.17.3
Python version: 2.7.14
PyStan version:
~/.R/Makevars contents:

CXXFLAGS=-O3 -mtune=native -march=native -Wno-unused-variable -Wno-unused-function  -Wno-macro-redefined
CXX=clang++ -arch x86_64 -ftemplate-depth-256


I think the lesson learned here is to ask people for compiler settings AND to wipe out any old stuff.

Caching models is nice, but obviously an obstacle at times.


I found the culprit, the Makevars file was

CXXFLAGS=-O3 -mtune=native -march=native
CXXFLAGS= -Wno-unused-variable -Wno-unused-function  -Wno-macro-redefined

CXX=clang++ -arch x86_64 -ftemplate-depth-256

Instead of

CXXFLAGS=-O3 -mtune=native -march=native -Wno-unused-variable -Wno-unused-function  -Wno-macro-redefined

CXX=clang++ -arch x86_64 -ftemplate-depth-256

Which I guess was overwriting the optimization flag.

I don’t remember how that file was created, probably copied code from different bits. Following the rstan installation manual results in the correct file below.


Another lesson is that we need some end-to-end integration/performance
tests making sure that the different interfaces produce exactly the same
results in roughly the same time.


By the way, for example, if fit is a stanfit object, you can use the following to show how it’s compiled:

S4 class cxxdso: dso_saved = TRUE, dso_filename = file204230ed44b7, size = 23.8 Mb.
And dso_last_path = '/tmp/Rtmpvm6BLt/'.
Created on: 'x86_64, linux-gnu' with 'CXXFLAGS = -g -O2 -fstack-protector --param=ssp-buffer-size=4 -Wformat -Werror=format-security -D_FORTIFY_SOURCE=2 -g $(LTO)'.
Loaded now: YES.
The signatures is/are as follows:


I think having those end-to-end tests would be great. That wouldn’t have solved this problem, but I think we should have them nonetheless.

Is there some way we can test for user compiler flags going awry? I don’t think we can insist on -O3 optimization, but then I don’t know a lot about builds.

Bit-level reproducibility requires locking down everything, including compiler, platform, compiler flags, hardware, OS, CPU, etc.