RStan is slower than Pystan


#22

Thanks a lot. Not good.

We will need to look into this. If you can share your example (with any data) that would be great; otherwise it takes a moment longer to dive in.


#23

I’m trying to. The first roadblock I ran into is that the original script doesn’t run. When I source it, the result is:

> source('slow.R')
Loading required package: lattice
Loading required package: ggplot2
Need help? Try the ggplot2 mailing list: http://groups.google.com/group/ggplot2.
Loading required package: StanHeaders
rstan (Version 2.17.3, GitRev: 2e1f913d3ca3)
For execution on a local, multicore CPU with excess RAM we recommend calling
options(mc.cores = parallel::detectCores()).
To avoid recompilation of unchanged Stan programs, we recommend calling
rstan_options(auto_write = TRUE)
Error in `[<-`(`*tmp*`, i, ((i - 1) * length(h3) + 1):(i * length(h3)),  : 
  subscript out of bounds
> 

#24

Summary: I ran it on Mac OS X with RStan 2.17.3 and Apple LLVM version 7.0.2 (clang-700.1.81) and it ran roughly the same speed as reported here for PyStan.

I couldn’t understand your data simulation script, so I wrote one directly in Stan, implementing the half normal with fabs.

data {
  int K;
  int N;
  int P;
  real a;
  real b;
  real alpha;
  real lambda;
}
generated quantities {
  real<lower=0> sigma;
  simplex[K] W[N];
  vector<lower=0>[K] H[P];
  real y[N, P];

  sigma = inv_gamma_rng(a, b);
  for (n in 1:N)
    W[n] = dirichlet_rng(rep_vector(alpha, K));
  for (p in 1:P)
    for (k in 1:K)
      H[p, k] = exponential_rng(lambda);
  for (n in 1:N)
    for (p in 1:P)
      y[n, p] = fabs(normal_rng(dot_product(W[n], H[p]), sigma));
}

I then simulated and fit using the OP’s model posted above with this:

print("SIMULATING DATA FROM PRIORS", quote = FALSE)

consts_hyperparams <- list(K = 3, N = 100, P = 10,
                           a = 1, b = 1, alpha = 1, lambda = 0.1)
sim <- stan("sim.stan", data = consts_hyperparams,
            algorithm = "Fixed_param",
            chains = 1, warmup = 0, iter = 1)
y_sim <- extract(sim)$y[1, , ]

print("FITTING SIMULATED DATA", quote = FALSE)


data_sim <- list(K = 3, N = 100, P = 10,
                 a = 1, b = 1, dir_alpha = 1, exp_lambda = 0.1,
                 y = abs(y_sim))

fit <- stan("slow.stan", data = data_sim, chains = 1, iter = 2000)

Result is inline with what you reported for PyStan:

 Elapsed Time: 17.2228 seconds (Warm-up)
               11.1658 seconds (Sampling)
               28.3886 seconds (Total)

This is on my now ancient Macbook Pro (Retina mid-2012, 2.3GHz i7, 6 GB 1600 MHz DDR3). I’m running Xcode (probably old because I’m still on the last OS):

~/temp2/slow$ clang++ --version
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin14.5.0
Thread model: posix

And here’s my session information:

> sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X Yosemite 10.10.5

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] rstan_2.17.3       StanHeaders_2.17.2 ggplot2_2.2.1     

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.15     codetools_0.2-15 grid_3.3.2       plyr_1.8.4       gtable_0.2.0     stats4_3.3.2    
 [7] scales_0.4.1     rlang_0.1.4      lazyeval_0.2.0   tools_3.3.2      munsell_0.4.3    inline_0.3.14   
[13] colorspace_1.3-2 gridExtra_2.2.1  tibble_1.3.4    
> 

It also roughlyr recovered sigma, though not quite in the 95% posterior interval:

> extract(sim)$sigma
[1] 4.07

> print(fit, "sigma", probs=c(0.025, 0.5, 0.975))
      mean se_mean  sd 2.5%  50% 97.5% n_eff Rhat
sigma 3.76       0 0.1 3.57 3.75  3.95  1000    1

Nothing I’d worry about with this kind of model.


#25

As an added bonus, here’s how to code the model to run more than twice as fast:

functions {
  matrix to_matrix(vector[] a) {
    int M = size(a);
    int N = rows(a[1]);
    matrix[M, N] a_mat;
    for (m in 1:M)
      for (n in 1:N)
        a_mat[m, n] = a[m, n];
    return a_mat;
  }
}
data {
  int<lower=0> N;
  int<lower=0> P;
  int<lower=0> K;
  matrix<lower = 0>[N, P] y;
  real<lower=0> exp_lambda;
  real<lower=0> dir_alpha;
  real<lower=0> a;
  real<lower=0> b;
}
transformed data{
  vector<lower=0>[K] alpha = rep_vector(dir_alpha, K);
}
parameters{
  simplex[K] W[N];
  matrix<lower = 0>[K, P] H;
  real<lower=0> sigma;
}
model{
  sigma ~ inv_gamma(a,b);
  to_vector(H) ~ exponential(exp_lambda);
  to_vector(y) ~ normal(to_vector(to_matrix(W) * H), sigma);
}

The runtime is:

 Elapsed Time: 6.49639 seconds (Warm-up)
               3.8764 seconds (Sampling)
               10.3728 seconds (Total)

and the result’s the same; the previous code took 28.3886 seconds (Total).

The trick is converting the W * H to matrix multiplication (much faster than M * N dot products because there are really good matrix multiply algorithms—the savings are much bigger for larger matrices). I also dropped the Dirichlet prior because it’s uniform and that’s the default. And I vectorized the other distributions. The copies involved in to_vector and whatnot are cheap compared to the savings.

It’s clunky to do these data-type conversions. We need to the to_matrix function in Stan that I implemented here as a function.

Sorry I don’t know what’s wrong with your install of RStan. This isn’t a particularly intensive model at this data size.


#26

Dear all, Thank you for your help and good discussion. I am trying to do the same calculation on another computer and the result of pystan and rstan became close enough. So I think there is something wrong with my computer. However, when I tried again in the previous computer but in this time, I deleted the .rds file from the previous compilation of stan file and recompile it again Now the speed of rstan become normal again.

So I think the previous rds file was from unoptimized code loaded automatically when I start rstan even though I already changed my stan file. But in pystan, you don’t have autowrite option so it will recompile the stan file again after I changed the stan file. Or maybe my rds file is just corrupted.


#27

Here is an example that exhibits the same behavior for me.

Files:
model.stan (710 Bytes)
sample_data.txt (43.0 KB)

sample_data.txt should be renamed to .json – current discourse settings do not let you upload json files.

R code:

> library("rstan")
Loading required package: ggplot2
Loading required package: StanHeaders
rstan (Version 2.17.3, GitRev: 2e1f913d3ca3)
> library("rjson")
>
> data <- fromJSON(file="sample_data.json")
> fit <- stan("model.stan", chains=1, iter=1000, warmup=400, data = data)

SAMPLING FOR MODEL 'model' NOW (CHAIN 1).

Gradient evaluation took 0.006025 seconds
1000 transitions using 10 leapfrog steps per transition would take 60.25 seconds.
Adjust your expectations accordingly!
...

Python:

> import pystan
> import json
> 
> data = json.load(open("sample_data.json"))
> model = pystan.StanModel(file="model.stan")
> fit = model.sampling(data=data, iter=1000, warmup=400, chains=1)

Gradient evaluation took 0.000489 seconds
1000 transitions using 10 leapfrog steps per transition would take 4.89 seconds.
Adjust your expectations accordingly!
...

Given what was written above, I will try on a different machine as soon as I have a chance


#28

Thanks Bob for following up in this!


#29

As an update: on a different (linux) box my example ran in the same time with both rstan and python, so it looks to be some system-specific problem on my mac laptop.


#30

Mind sharing the configuration? It would help us, the developers, out if we understood how this behavior is happening.

That’s essentially whatever information you have about:

  • operating system version
  • c++ compiler including version
  • if you have a second c++ compiler installed
  • r version
  • RStan version
  • r makefiles
  • Python version
  • PyStan version
  • any specific build instructions for PyStan

I think if we had that we could try to reproduce it.

The simplest reasons this could different:

  • different compilers being used
  • different optimization levels used to compile the C++ (it should be at level 3 without the debug flag)

After that, it’s a guessing game.


#31

I ran this on my mac and did not observe any difference in performance between the interfaces.

Operating system: OS 10.13.3
gcc --version : Apple LLVM version 9.0.0 (clang-900.0.39.2)
R version: 3.4.3
RStan version: 2.17.3
Python version: 2.7.14
PyStan version: 2.17.1.0
~/.R/Makevars contents:

CXXFLAGS=-O3 -mtune=native -march=native -Wno-unused-variable -Wno-unused-function  -Wno-macro-redefined
CC=clang
CXX=clang++ -arch x86_64 -ftemplate-depth-256

#32

I think the lesson learned here is to ask people for compiler settings AND to wipe out any old stuff.

Caching models is nice, but obviously an obstacle at times.


#33

I found the culprit, the Makevars file was

CXXFLAGS=-O3 -mtune=native -march=native
CXXFLAGS= -Wno-unused-variable -Wno-unused-function  -Wno-macro-redefined

CC=clang
CXX=clang++ -arch x86_64 -ftemplate-depth-256

Instead of

CXXFLAGS=-O3 -mtune=native -march=native -Wno-unused-variable -Wno-unused-function  -Wno-macro-redefined

CC=clang
CXX=clang++ -arch x86_64 -ftemplate-depth-256

Which I guess was overwriting the optimization flag.

I don’t remember how that file was created, probably copied code from different bits. Following the rstan installation manual results in the correct file below.


#34

Another lesson is that we need some end-to-end integration/performance
tests making sure that the different interfaces produce exactly the same
results in roughly the same time.


#35

By the way, for example, if fit is a stanfit object, you can use the following to show how it’s compiled:

fit@stanmodel@dso
S4 class cxxdso: dso_saved = TRUE, dso_filename = file204230ed44b7, size = 23.8 Mb.
And dso_last_path = '/tmp/Rtmpvm6BLt/file204230ed44b7.so'.
Created on: 'x86_64, linux-gnu' with 'CXXFLAGS = -g -O2 -fstack-protector --param=ssp-buffer-size=4 -Wformat -Werror=format-security -D_FORTIFY_SOURCE=2 -g $(LTO)'.
Loaded now: YES.
The signatures is/are as follows:
$file204230ed44b7
character(0)

#36

I think having those end-to-end tests would be great. That wouldn’t have solved this problem, but I think we should have them nonetheless.

Is there some way we can test for user compiler flags going awry? I don’t think we can insist on -O3 optimization, but then I don’t know a lot about builds.

Bit-level reproducibility requires locking down everything, including compiler, platform, compiler flags, hardware, OS, CPU, etc.