RStan is slower than Pystan

Operating System: Manjaro Linux 17.1.6
Interface Version: rstan 2.17.3
Compiler/Toolkit: GCC 7.3.0

I usually uses pystan but because some reason I need to use rstan. Using code:

data {
    int<lower=0> N; // samples
    int<lower=0> P; // features
    int<lower=0> K; // hidden variables
    real<lower=0> y[N,P]; // target matrix
    real<lower=0> exp_lambda; // parameter for exponential distribution of H
    real<lower=0> dir_alpha; // parameter for dirichlet distribution of W
    real<lower=0> a;
    real<lower=0> b;
    //real<lower=0> sigma;
} 

transformed data{
    vector<lower=0>[K] dir_alpha_vec;
    for (k in 1:K){
        dir_alpha_vec[k] = dir_alpha;
    }
}

parameters{
    simplex[K] W[N];               // W[N,K] basis matrix
    vector<lower=0>[K] H[P];       // H[K,p] coefficient matrix
    real<lower=0> sigma;
}

model{
    for (n in 1:N) {
        W[n] ~ dirichlet(dir_alpha_vec);
    }
    for ( p in 1:P) {
        H[p] ~ exponential(exp_lambda);
    }
    sigma ~ inv_gamma(a,b);
    for (n in 1:N){
        for (p in 1:P){
            target += normal_lpdf(y[n,p] | dot_product(W[n] , H[p]),sigma);
        }
    }
}

in rstan is much slower compared to pystan to decompose 100 x 10 matrix. Is there any idea why this is happen?

I can believe they’re not identical.

The only difference in what gets run is I/O for the underlying data. Usually that’s not a bottleneck.

The C++ for actually running Stan’s algorithm is exactly the same. Are you using the same compilers and compiler settings in both platforms?

I am sorry if this is a newbie question because I am not really a programmer but how to check compilers and compiler settings in both platforms. For pystan, I installed anaconda and then install pystan using pip. For rstan, I follow the instruction from the website. When I check the gcc version from terminal, it is GCC 7.3.0 version. I am using both interfaces in the same computer.

And I also use options(mc.cores = parallel::detectCores()) in rstan.

RStan should be using the GCC from the terminal. You can check the R installation directory file Makevars to make sure it has the -O3 optimization settings. I don’t know how to check in PyStan.

When you say RStan is much slower compared to PyStan, what exactly did you run and under what conditions and what were the results?

My Makevars file:

CXXFLAGS=-O3 -mtune=native -march=native
CXXFLAGS= -Wno-unused-variable -Wno-unused-function  -Wno-macro-redefined

CXXFLAGS+=-flto -ffat-lto-objects  -Wno-unused-local-typedefs

CXXFLAGS += -Wno-ignored-attributes -Wno-deprecated-declarations

I perform matrix factorization. I want to factorize matrix X become two matrix W and H (X = W x H). The matrix X has dimension 100 x 10.

The time result for pystan is
Elapsed Time: 14.6851 seconds (Warm-up)
7.63082 seconds (Sampling)
22.3159 seconds (Total)
There is no result for rstan yet because it is still running for more than 40 minutes.

Here is my full code:

library(caret)
library(rstan)
rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())

# Create W,H1,H2,H3 --------------
w <- rep(1, 30)
h1 <- rep(1, 3)
h2 <- rep(1, 5)
h3 <- rep(1, 5)


sample = 100
feature1 = 10
feature2 = 17
feature3 = 10
K=3

W = matrix(rep(0, sample * K), nrow = sample)
H1 = matrix(rep(0, K * feature1), nrow = K)
H2 = matrix(rep(0, K * feature2), nrow = K)
H3 = matrix(rep(0, K * feature3), nrow = K)


for (i in 1:K){
  
  W[((i-1)*length(w)+1):(i*length(w)),i]=w
  H1[i,((i-1)*length(h1)+1):(i*length(h1))]=h1
  H2[i,((i-1)*length(h2)+1):(i*length(h2))]=h2
  H3[i,((i-1)*length(h3)+1):(i*length(h3))]=h3
}



# Create original matrix ---------------------
X1 <- W %*% H1 # gene expression
X2 <- W %*% H2 # epigenetic location
X3 <- W %*% H3 # SNP

XX1 =  X1 *  rnorm(n= length(X1),4, 1)
XX2 =  X2 *  rnorm(n= length(X2),4, 1)
XX3 =  X3 *  sample(0:2, size= length(X3), 
                    replace =TRUE,
                    prob =c(0.1,0.3,0.6))

XXX1 = XX1
XXX2 = XX2
XXX3 = XX3

# changed every point in datasets which has value 0 
XXX1[which(X1==0)] = rnorm(n= length(XXX1[which(X1==0)]), 2, 1)
XXX2[which(X2==0)] = rnorm(n= length(XXX2[which(X2==0)]), 2, 1)
XXX3[which(X3==0)] = sample( 0:2, prob=c(0.7,0.2,0.1),
                             size= length(XXX3[which(X3==0)]),
                             replace =TRUE)


#Delete zero value by added datasets with its min value
XXX1[which(XXX1<0)] = 0 
XXX2[which(XXX2<0)] = 0 
XXX3[which(XXX3<0)] = 0 

nmf_single= stan_model(file="NMF_single.stan")
data_single = list(N = sample,
                   P = feature1,
                   K= K+1,
                   y= XXX1,
                   exp_lambda=.1,
                   dir_alpha=1,a=1,b=1)
nmf_single_fit = sampling(nmf_single, data = data_single, 
                          iter = 2000, chains = 1)

That sounds like some kind of bug. Can you try runing it with a single core and see what happens?

I am sorry there is a mistake in my previous post. I used a larger dataset in Rstan so the sampling time obviously becomes longer. After I revised my datasets in both pystan and rstan using smaller datasets it become:
The result for rstan multicores =

Elapsed Time: 175.71 seconds (Warm-up)
95.3634 seconds (Sampling)
271.073 seconds (Total)

The result for rstan single core=

Elapsed Time: 253.347 seconds (Warm-up)
384.046 seconds (Sampling)
637.393 seconds (Total)

The result for pystan

Elapsed Time: 13.2425 seconds (Warm-up)
7.18552 seconds (Sampling)
20.428 seconds (Total)

It still takes rstan longer time to do sampling.

This is more of an indication of the difference in execution times between Python and R rather than between RStan and PyStan?

Unlikely; for both interfaces the vast majority of computation is done in the compiled C++ executable, which should be the same if using the same model and input data.

The interfaces only give the executable the data at the beginning then grab the output at the end (well, also receive and print messages from the executable), so any differences in their memory management shouldn’t affect compute time. Maybe if you have huge data input and then somehow a ridiculously fast model, a difference might affect things, but I think that’s a stretch. I also might be wrong in my understanding though.

1 Like

It is very unexpected to see such a large difference in compute time between the interfaces if the data and model are identical. Maybe double check that you actually have the same data by doing all your data prep in R, then save to a format that python can read (like feather, run the sampling in R then go to python, read the data, and sample there.

1 Like

Certainly nobody else has reported order of magnitude differences. You can use CmdStan as a benchmark, which is very lean and just streams data out.

The only thing I could imagine would be if you run out of memory and start swapping in R and not in Python. Otherwise, there’s just not that much going on in the interfaces—all the work’s on the C++ side.

I save the dataset from the R code as your suggestion as csv file and read it in the python. Then, I use the same number of iteration and chain. The result is:

RStan:

Elapsed Time: 229.444 seconds (Warm-up)
217.377 seconds (Sampling)
446.821 seconds (Total)

Pystan:

Elapsed Time: 16.2758 seconds (Warm-up)
9.97611 seconds (Sampling)
26.2519 seconds (Total)

I am not sure the reason for this difference.

Which version of RStan + Pystan?

This is really odd!

Can you post a sessionInfo() from R please?

In python do

import pystan
print(pystan.__version__)

And python version (to terminal)

python --version

To my surprise I also ran into this after moving the model from python to R and seeing much slower sampling than expected. I went back and took the exact same dataset, saved to the same json file and model code and ran it with pystan and rstan. The gradient evaluation time differed by an order of magnitude.

>>> print(pystan.__version__)
2.17.1.0

$ python --version
Python 2.7.10

R:
rstan (Version 2.17.3, GitRev: 2e1f913d3ca3)

Makevars:

CXXFLA GS=-O3 -mtune=native -march=native
CXXFLAGS= -Wno-unused-variable -Wno-unused-function  -Wno-macro-redefined
CC=clang
CXX=clang++ -arch x86_64 -ftemplate-depth-256

It’s a basic variable slopes hierarchical model and the dataset has an ~ 50k X 10 covariate matrix.

2 Likes

This is really scary and sounds like it needs investigation.

RStan is terribly slow whenever you have lots of generated quantities and/or lots of parameters/transformed parameters as this leads to a massive I/O burden where RStan is super in-efficient the last time I looked at this.

So what you could try is to disable the output of most variables and only output the means in RStan. Would that be possible to try (use the pars argument of the sampling/stan statement)?

1 Like

Adding pars=c(...) with a couple of parameters to the stan(...) call doesn’t seem to meaningfully change the gradient evaluation time, but maybe I misunderstood what is needed?

The model does have it’s number of transformed parameters: it has ~300 categories x ~5-10 covariates, parameterized with the cholesky factorization

Thanks. This is what I meant. Very odd. I don’t quite get what is happening.

Any chance to compare cmdstan to this?

I had to do stan_rdump from python because the R version was silently failing to create a data file, but in the end cmdstand version reports the same timings as pystan.