Thanks a lot. Not good.
We will need to look into this. If you can share your example (with any data) that would be great; otherwise it takes a moment longer to dive in.
Thanks a lot. Not good.
We will need to look into this. If you can share your example (with any data) that would be great; otherwise it takes a moment longer to dive in.
Iām trying to. The first roadblock I ran into is that the original script doesnāt run. When I source it, the result is:
> source('slow.R')
Loading required package: lattice
Loading required package: ggplot2
Need help? Try the ggplot2 mailing list: http://groups.google.com/group/ggplot2.
Loading required package: StanHeaders
rstan (Version 2.17.3, GitRev: 2e1f913d3ca3)
For execution on a local, multicore CPU with excess RAM we recommend calling
options(mc.cores = parallel::detectCores()).
To avoid recompilation of unchanged Stan programs, we recommend calling
rstan_options(auto_write = TRUE)
Error in `[<-`(`*tmp*`, i, ((i - 1) * length(h3) + 1):(i * length(h3)), :
subscript out of bounds
>
Summary: I ran it on Mac OS X with RStan 2.17.3 and Apple LLVM version 7.0.2 (clang-700.1.81) and it ran roughly the same speed as reported here for PyStan.
I couldnāt understand your data simulation script, so I wrote one directly in Stan, implementing the half normal with fabs
.
data {
int K;
int N;
int P;
real a;
real b;
real alpha;
real lambda;
}
generated quantities {
real<lower=0> sigma;
simplex[K] W[N];
vector<lower=0>[K] H[P];
real y[N, P];
sigma = inv_gamma_rng(a, b);
for (n in 1:N)
W[n] = dirichlet_rng(rep_vector(alpha, K));
for (p in 1:P)
for (k in 1:K)
H[p, k] = exponential_rng(lambda);
for (n in 1:N)
for (p in 1:P)
y[n, p] = fabs(normal_rng(dot_product(W[n], H[p]), sigma));
}
I then simulated and fit using the OPās model posted above with this:
print("SIMULATING DATA FROM PRIORS", quote = FALSE)
consts_hyperparams <- list(K = 3, N = 100, P = 10,
a = 1, b = 1, alpha = 1, lambda = 0.1)
sim <- stan("sim.stan", data = consts_hyperparams,
algorithm = "Fixed_param",
chains = 1, warmup = 0, iter = 1)
y_sim <- extract(sim)$y[1, , ]
print("FITTING SIMULATED DATA", quote = FALSE)
data_sim <- list(K = 3, N = 100, P = 10,
a = 1, b = 1, dir_alpha = 1, exp_lambda = 0.1,
y = abs(y_sim))
fit <- stan("slow.stan", data = data_sim, chains = 1, iter = 2000)
Result is inline with what you reported for PyStan:
Elapsed Time: 17.2228 seconds (Warm-up)
11.1658 seconds (Sampling)
28.3886 seconds (Total)
This is on my now ancient Macbook Pro (Retina mid-2012, 2.3GHz i7, 6 GB 1600 MHz DDR3). Iām running Xcode (probably old because Iām still on the last OS):
~/temp2/slow$ clang++ --version
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin14.5.0
Thread model: posix
And hereās my session information:
> sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X Yosemite 10.10.5
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] rstan_2.17.3 StanHeaders_2.17.2 ggplot2_2.2.1
loaded via a namespace (and not attached):
[1] Rcpp_0.12.15 codetools_0.2-15 grid_3.3.2 plyr_1.8.4 gtable_0.2.0 stats4_3.3.2
[7] scales_0.4.1 rlang_0.1.4 lazyeval_0.2.0 tools_3.3.2 munsell_0.4.3 inline_0.3.14
[13] colorspace_1.3-2 gridExtra_2.2.1 tibble_1.3.4
>
It also roughlyr recovered sigma
, though not quite in the 95% posterior interval:
> extract(sim)$sigma
[1] 4.07
> print(fit, "sigma", probs=c(0.025, 0.5, 0.975))
mean se_mean sd 2.5% 50% 97.5% n_eff Rhat
sigma 3.76 0 0.1 3.57 3.75 3.95 1000 1
Nothing Iād worry about with this kind of model.
As an added bonus, hereās how to code the model to run more than twice as fast:
functions {
matrix to_matrix(vector[] a) {
int M = size(a);
int N = rows(a[1]);
matrix[M, N] a_mat;
for (m in 1:M)
for (n in 1:N)
a_mat[m, n] = a[m, n];
return a_mat;
}
}
data {
int<lower=0> N;
int<lower=0> P;
int<lower=0> K;
matrix<lower = 0>[N, P] y;
real<lower=0> exp_lambda;
real<lower=0> dir_alpha;
real<lower=0> a;
real<lower=0> b;
}
transformed data{
vector<lower=0>[K] alpha = rep_vector(dir_alpha, K);
}
parameters{
simplex[K] W[N];
matrix<lower = 0>[K, P] H;
real<lower=0> sigma;
}
model{
sigma ~ inv_gamma(a,b);
to_vector(H) ~ exponential(exp_lambda);
to_vector(y) ~ normal(to_vector(to_matrix(W) * H), sigma);
}
The runtime is:
Elapsed Time: 6.49639 seconds (Warm-up)
3.8764 seconds (Sampling)
10.3728 seconds (Total)
and the resultās the same; the previous code took 28.3886 seconds (Total)
.
The trick is converting the W * H
to matrix multiplication (much faster than M * N dot products because there are really good matrix multiply algorithmsāthe savings are much bigger for larger matrices). I also dropped the Dirichlet prior because itās uniform and thatās the default. And I vectorized the other distributions. The copies involved in to_vector
and whatnot are cheap compared to the savings.
Itās clunky to do these data-type conversions. We need to the to_matrix
function in Stan that I implemented here as a function.
Sorry I donāt know whatās wrong with your install of RStan. This isnāt a particularly intensive model at this data size.
Dear all, Thank you for your help and good discussion. I am trying to do the same calculation on another computer and the result of pystan and rstan became close enough. So I think there is something wrong with my computer. However, when I tried again in the previous computer but in this time, I deleted the .rds file from the previous compilation of stan file and recompile it again Now the speed of rstan become normal again.
So I think the previous rds file was from unoptimized code loaded automatically when I start rstan even though I already changed my stan file. But in pystan, you donāt have autowrite option so it will recompile the stan file again after I changed the stan file. Or maybe my rds file is just corrupted.
Here is an example that exhibits the same behavior for me.
Files:
model.stan (710 Bytes)
sample_data.txt (43.0 KB)
sample_data.txt
should be renamed to .json
ā current discourse settings do not let you upload json files.
R code:
> library("rstan")
Loading required package: ggplot2
Loading required package: StanHeaders
rstan (Version 2.17.3, GitRev: 2e1f913d3ca3)
> library("rjson")
>
> data <- fromJSON(file="sample_data.json")
> fit <- stan("model.stan", chains=1, iter=1000, warmup=400, data = data)
SAMPLING FOR MODEL 'model' NOW (CHAIN 1).
Gradient evaluation took 0.006025 seconds
1000 transitions using 10 leapfrog steps per transition would take 60.25 seconds.
Adjust your expectations accordingly!
...
Python:
> import pystan
> import json
>
> data = json.load(open("sample_data.json"))
> model = pystan.StanModel(file="model.stan")
> fit = model.sampling(data=data, iter=1000, warmup=400, chains=1)
Gradient evaluation took 0.000489 seconds
1000 transitions using 10 leapfrog steps per transition would take 4.89 seconds.
Adjust your expectations accordingly!
...
Given what was written above, I will try on a different machine as soon as I have a chance
Thanks Bob for following up in this!
As an update: on a different (linux) box my example ran in the same time with both rstan and python, so it looks to be some system-specific problem on my mac laptop.
Mind sharing the configuration? It would help us, the developers, out if we understood how this behavior is happening.
Thatās essentially whatever information you have about:
I think if we had that we could try to reproduce it.
The simplest reasons this could different:
After that, itās a guessing game.
I ran this on my mac and did not observe any difference in performance between the interfaces.
Operating system: OS 10.13.3
gcc --version
: Apple LLVM version 9.0.0 (clang-900.0.39.2)
R version: 3.4.3
RStan version: 2.17.3
Python version: 2.7.14
PyStan version: 2.17.1.0
~/.R/Makevars contents:
CXXFLAGS=-O3 -mtune=native -march=native -Wno-unused-variable -Wno-unused-function -Wno-macro-redefined
CC=clang
CXX=clang++ -arch x86_64 -ftemplate-depth-256
I think the lesson learned here is to ask people for compiler settings AND to wipe out any old stuff.
Caching models is nice, but obviously an obstacle at times.
I found the culprit, the Makevars file was
CXXFLAGS=-O3 -mtune=native -march=native
CXXFLAGS= -Wno-unused-variable -Wno-unused-function -Wno-macro-redefined
CC=clang
CXX=clang++ -arch x86_64 -ftemplate-depth-256
Instead of
CXXFLAGS=-O3 -mtune=native -march=native -Wno-unused-variable -Wno-unused-function -Wno-macro-redefined
CC=clang
CXX=clang++ -arch x86_64 -ftemplate-depth-256
Which I guess was overwriting the optimization flag.
I donāt remember how that file was created, probably copied code from different bits. Following the rstan installation manual results in the correct file below.
Another lesson is that we need some end-to-end integration/performance
tests making sure that the different interfaces produce exactly the same
results in roughly the same time.
By the way, for example, if fit
is a stanfit object, you can use the following to show how itās compiled:
fit@stanmodel@dso
S4 class cxxdso: dso_saved = TRUE, dso_filename = file204230ed44b7, size = 23.8 Mb.
And dso_last_path = '/tmp/Rtmpvm6BLt/file204230ed44b7.so'.
Created on: 'x86_64, linux-gnu' with 'CXXFLAGS = -g -O2 -fstack-protector --param=ssp-buffer-size=4 -Wformat -Werror=format-security -D_FORTIFY_SOURCE=2 -g $(LTO)'.
Loaded now: YES.
The signatures is/are as follows:
$file204230ed44b7
character(0)
I think having those end-to-end tests would be great. That wouldnāt have solved this problem, but I think we should have them nonetheless.
Is there some way we can test for user compiler flags going awry? I donāt think we can insist on -O3
optimization, but then I donāt know a lot about builds.
Bit-level reproducibility requires locking down everything, including compiler, platform, compiler flags, hardware, OS, CPU, etc.