Hi everyone,
DISCLAIMER: long post here
I made some experiments to move some very time consuming algorithms on the computing cluster available to my company.
Just to understand how to use queues etc… I started with my simplest model an ordered logistic regression with N=6497 samples, 11 predictors, 7 different scores. This model will evolve to a hierarchical formulation, but so far it is just a straightforward copy-paste form the Stan reference manual (v.2.17.0, page 138). No priors specified: it is just as in the book.
data {
int<lower=2> K;
int<lower=0> N;
int<lower=1> D;
int<lower=1,upper=K> y[N];
row_vector[D] x[N];
}
parameters {
vector[D] beta;
ordered[K-1] c;
}
model {
for (n in 1:N)
y[n] ~ ordered_logistic(x[n] * beta, c);
}
I started to try it on my own dektop an 8 processors Intel® Xeon® CPU E31245 @ 3.30GHz with 8GB RAM.
rstan with just one chain, 2000 iteration with given seed: 12345 gave this results
Elapsed Time: 972.488 seconds (Warm-up)
1088.72 seconds (Sampling)
2061.21 seconds (Total)
Inference for Stan model: orderedLogistic.
1 chains, each with iter=2000; warmup=1000; thin=1;
post-warmup draws per chain=1000, total post-warmup draws=1000.
mean se_mean sd 2.5% 25% 50% 75% 97.5%
beta[1] 0.04 0.00 0.02 -0.01 0.02 0.04 0.05 0.08
beta[2] -3.87 0.01 0.20 -4.26 -4.00 -3.87 -3.73 -3.49
beta[3] -0.27 0.01 0.20 -0.65 -0.40 -0.27 -0.14 0.12
beta[4] 0.06 0.00 0.01 0.05 0.05 0.06 0.06 0.07
beta[5] -1.27 0.02 0.63 -2.55 -1.72 -1.26 -0.85 -0.09
beta[6] 0.02 0.00 0.00 0.01 0.02 0.02 0.02 0.02
beta[7] -0.01 0.00 0.00 -0.01 -0.01 -0.01 -0.01 0.00
beta[8] -0.10 0.04 1.00 -2.23 -0.75 -0.11 0.60 1.79
beta[9] 0.53 0.01 0.17 0.20 0.41 0.52 0.65 0.87
beta[10] 1.65 0.01 0.18 1.31 1.53 1.65 1.76 2.01
beta[11] 0.90 0.00 0.03 0.85 0.88 0.90 0.92 0.95
c[1] 4.70 0.06 1.25 2.14 3.83 4.69 5.59 7.08
c[2] 6.91 0.06 1.24 4.40 6.07 6.90 7.78 9.25
c[3] 10.08 0.06 1.24 7.55 9.25 10.09 10.96 12.43
c[4] 12.67 0.06 1.25 10.13 11.83 12.68 13.54 15.06
c[5] 15.00 0.06 1.25 12.47 14.15 14.98 15.87 17.46
c[6] 18.86 0.06 1.33 16.29 17.92 18.87 19.75 21.46
lp__ -7106.86 0.15 2.83 -7113.57 -7108.57 -7106.56 -7104.87 -7102.19
n_eff Rhat
beta[1] 665 1
beta[2] 592 1
beta[3] 599 1
beta[4] 1000 1
beta[5] 1000 1
beta[6] 1000 1
beta[7] 1000 1
beta[8] 648 1
beta[9] 543 1
beta[10] 793 1
beta[11] 660 1
c[1] 431 1
c[2] 435 1
c[3] 434 1
c[4] 433 1
c[5] 433 1
c[6] 448 1
lp__ 381 1
Samples were drawn using NUTS(diag_e) at Tue Jun 5 13:33:19 2018.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).
Rhat is 1 for all parameters, n_eff is quite high, no divergences.
I tried the same model ON my machine, on the same data, with the same seed with CmdStan gave
Inference for Stan model: orderedLogistic_model
1 chains: each with iter=(1000); warmup=(0); thin=(1); 1000 iterations saved.
Warmup took (1165) seconds, 19 minutes total
Sampling took (929) seconds, 15 minutes total
Mean MCSE StdDev 5% 50% 95% N_Eff N_Eff/s R_hat
lp__ -7.1e+03 1.3e-01 2.9e+00 -7.1e+03 -7.1e+03 -7.1e+03 493 5.3e-01 1.0e+00
accept_stat__ 9.4e-01 2.4e-03 7.7e-02 7.9e-01 9.8e-01 1.0e+00 1000 1.1e+00 1.0e+00
stepsize__ 1.3e-02 5.6e-17 4.0e-17 1.3e-02 1.3e-02 1.3e-02 0.50 5.4e-04 1.0e+00
treedepth__ 8.0e+00 6.0e-03 1.9e-01 8.0e+00 8.0e+00 8.0e+00 1000 1.1e+00 1.0e+00
n_leapfrog__ 2.8e+02 2.5e+00 7.6e+01 2.6e+02 2.6e+02 5.1e+02 946 1.0e+00 1.0e+00
divergent__ 0.0e+00 0.0e+00 0.0e+00 0.0e+00 0.0e+00 0.0e+00 1000 1.1e+00 -nan
energy__ 7.1e+03 2.2e-01 4.1e+00 7.1e+03 7.1e+03 7.1e+03 325 3.5e-01 1.0e+00
beta[1] 3.4e-02 8.6e-04 2.5e-02 -7.6e-03 3.5e-02 7.5e-02 828 8.9e-01 1.0e+00
beta[2] -3.9e+00 7.1e-03 1.9e-01 -4.2e+00 -3.9e+00 -3.5e+00 719 7.7e-01 1.0e+00
beta[3] -2.6e-01 7.5e-03 2.0e-01 -5.9e-01 -2.6e-01 8.9e-02 743 8.0e-01 1.0e+00
beta[4] 5.8e-02 1.9e-04 5.9e-03 4.9e-02 5.8e-02 6.8e-02 1000 1.1e+00 1.0e+00
beta[5] -1.4e+00 2.1e-02 6.4e-01 -2.4e+00 -1.4e+00 -2.9e-01 943 1.0e+00 1.0e+00
beta[6] 1.8e-02 6.4e-05 2.0e-03 1.5e-02 1.8e-02 2.1e-02 1000 1.1e+00 1.0e+00
beta[7] -6.3e-03 2.3e-05 7.3e-04 -7.5e-03 -6.3e-03 -5.1e-03 1000 1.1e+00 1.0e+00
beta[8] -1.4e-01 3.8e-02 1.0e+00 -1.8e+00 -1.2e-01 1.5e+00 758 8.2e-01 1.0e+00
beta[9] 5.1e-01 5.8e-03 1.7e-01 2.1e-01 5.2e-01 7.8e-01 866 9.3e-01 1.0e+00
beta[10] 1.7e+00 5.8e-03 1.8e-01 1.4e+00 1.7e+00 2.0e+00 1000 1.1e+00 1.0e+00
beta[11] 9.0e-01 9.3e-04 2.6e-02 8.6e-01 9.0e-01 9.4e-01 796 8.6e-01 1.0e+00
c[1] 4.6e+00 5.2e-02 1.2e+00 2.5e+00 4.6e+00 6.6e+00 544 5.9e-01 1.0e+00
c[2] 6.8e+00 5.1e-02 1.2e+00 4.8e+00 6.8e+00 8.8e+00 554 6.0e-01 1.0e+00
c[3] 1.0e+01 5.2e-02 1.2e+00 7.9e+00 1.0e+01 1.2e+01 546 5.9e-01 1.0e+00
c[4] 1.3e+01 5.2e-02 1.2e+00 1.1e+01 1.3e+01 1.5e+01 544 5.9e-01 1.0e+00
c[5] 1.5e+01 5.2e-02 1.2e+00 1.3e+01 1.5e+01 1.7e+01 549 5.9e-01 1.0e+00
c[6] 1.9e+01 5.5e-02 1.3e+00 1.7e+01 1.9e+01 2.1e+01 560 6.0e-01 1.0e+00
Samples were drawn using hmc with nuts.
For each parameter, N_Eff is a crude measure of effective sample size,
and R_hat is the potential scale reduction factor on split chains (at
convergence, R_hat=1).
So basically same time to run (more or less), same results.
BUT when i run the same model on the same data, same iteration number, some seed, 8GB memory with CmdStan built by myself on the cluster, this is what I get
Inference for Stan model: orderedLogistic_model
1 chains: each with iter=(1000); warmup=(0); thin=(1); 1000 iterations saved.
Warmup took (2034) seconds, 34 minutes total
Sampling took (3813) seconds, 1.1 hours total
Mean MCSE StdDev 5% 50% 95% N_Eff N_Eff/s R_hat
lp__ -7.1e+03 3.9e-01 3.0e+00 -7.1e+03 -7.1e+03 -7.1e+03 58 1.5e-02 1.0e+00
accept_stat__ 9.3e-01 3.1e-03 9.7e-02 7.3e-01 9.7e-01 1.0e+00 1000 2.6e-01 1.0e+00
stepsize__ 4.4e-04 3.1e-18 2.2e-18 4.4e-04 4.4e-04 4.4e-04 0.50 1.3e-04 1.0e+00
treedepth__ 1.0e+01 1.2e-15 3.7e-14 1.0e+01 1.0e+01 1.0e+01 1000 2.6e-01 1.0e+00
n_leapfrog__ 1.0e+03 7.9e-14 2.5e-12 1.0e+03 1.0e+03 1.0e+03 1000 2.6e-01 1.0e+00
divergent__ 0.0e+00 0.0e+00 0.0e+00 0.0e+00 0.0e+00 0.0e+00 1000 2.6e-01 -nan
energy__ 7.1e+03 4.0e-01 4.2e+00 7.1e+03 7.1e+03 7.1e+03 108 2.8e-02 1.0e+00
beta[1] 2.1e-01 1.4e-02 4.1e-02 1.4e-01 2.1e-01 2.7e-01 8.8 2.3e-03 1.6e+00
beta[2] -3.6e+00 4.1e-02 2.0e-01 -3.9e+00 -3.6e+00 -3.2e+00 23 6.1e-03 1.0e+00
beta[3] -3.0e-01 5.3e-02 2.0e-01 -6.4e-01 -2.9e-01 1.3e-02 15 3.8e-03 1.0e+00
beta[4] 1.2e-01 3.9e-03 1.2e-02 1.0e-01 1.2e-01 1.4e-01 9.2 2.4e-03 1.6e+00
beta[5] -1.4e+00 2.1e-01 8.7e-01 -2.7e+00 -1.6e+00 1.4e-01 17 4.4e-03 1.1e+00
beta[6] 1.8e-02 2.9e-04 2.2e-03 1.5e-02 1.8e-02 2.2e-02 56 1.5e-02 1.0e+00
beta[7] -7.2e-03 9.7e-05 7.4e-04 -8.3e-03 -7.2e-03 -6.0e-03 58 1.5e-02 1.0e+00
beta[8] -1.6e+02 9.8e+00 2.6e+01 -2.1e+02 -1.6e+02 -1.2e+02 7.1 1.9e-03 1.7e+00
beta[9] 1.3e+00 8.0e-02 2.1e-01 9.8e-01 1.3e+00 1.7e+00 6.8 1.8e-03 1.4e+00
beta[10] 2.2e+00 6.0e-02 1.8e-01 1.9e+00 2.2e+00 2.4e+00 8.6 2.3e-03 1.2e+00
beta[11] 7.0e-01 1.3e-02 3.6e-02 6.5e-01 7.0e-01 7.6e-01 8.2 2.1e-03 1.3e+00
c[1] -1.6e+02 9.6e+00 2.6e+01 -2.0e+02 -1.5e+02 -1.1e+02 7.2 1.9e-03 1.7e+00
c[2] -1.5e+02 9.6e+00 2.6e+01 -2.0e+02 -1.5e+02 -1.1e+02 7.2 1.9e-03 1.7e+00
c[3] -1.5e+02 9.6e+00 2.6e+01 -1.9e+02 -1.5e+02 -1.1e+02 7.2 1.9e-03 1.7e+00
c[4] -1.5e+02 9.6e+00 2.6e+01 -1.9e+02 -1.5e+02 -1.0e+02 7.2 1.9e-03 1.7e+00
c[5] -1.5e+02 9.6e+00 2.6e+01 -1.9e+02 -1.4e+02 -1.0e+02 7.2 1.9e-03 1.7e+00
c[6] -1.4e+02 9.5e+00 2.6e+01 -1.8e+02 -1.4e+02 -9.9e+01 7.3 1.9e-03 1.7e+00
Samples were drawn using hmc with nuts.
For each parameter, N_Eff is a crude measure of effective sample size,
and R_hat is the potential scale reduction factor on split chains (at
convergence, R_hat=1).
I am not concerned that on the cluster took 1h 40 mins to run and on my computer (35 min on my local machine), I am really surprise to have my cutpoints collapsing, n_eff incredibly small, and Rhats that shows bad convergence.
I am really puzzled. What is going on on the two different machines?