Stan on computing cluster: strange results

Hi everyone,
DISCLAIMER: long post here
I made some experiments to move some very time consuming algorithms on the computing cluster available to my company.
Just to understand how to use queues etc… I started with my simplest model an ordered logistic regression with N=6497 samples, 11 predictors, 7 different scores. This model will evolve to a hierarchical formulation, but so far it is just a straightforward copy-paste form the Stan reference manual (v.2.17.0, page 138). No priors specified: it is just as in the book.

data {
    int<lower=2> K;
    int<lower=0> N;
    int<lower=1> D;
    int<lower=1,upper=K> y[N];
    row_vector[D] x[N];
}

parameters {
    vector[D] beta;
    ordered[K-1] c;
}

model {
 
    for (n in 1:N)
        y[n] ~ ordered_logistic(x[n] * beta, c);
}

I started to try it on my own dektop an 8 processors Intel® Xeon® CPU E31245 @ 3.30GHz with 8GB RAM.
rstan with just one chain, 2000 iteration with given seed: 12345 gave this results

Elapsed Time: 972.488 seconds (Warm-up)
               1088.72 seconds (Sampling)
               2061.21 seconds (Total)
Inference for Stan model: orderedLogistic.                                                                                                                                                                                                                                                                                                           
1 chains, each with iter=2000; warmup=1000; thin=1;                                                                                                                                                                                                                                                                                                  
post-warmup draws per chain=1000, total post-warmup draws=1000.                                                                                                                                                                                                                                                                                      
                                                                                                                                                                                                                                                                                                                                                     
             mean se_mean   sd     2.5%      25%      50%      75%    97.5%                                                                                                                                                                                                                                                                          
beta[1]      0.04    0.00 0.02    -0.01     0.02     0.04     0.05     0.08                                                                                                                                                                                                                                                                          
beta[2]     -3.87    0.01 0.20    -4.26    -4.00    -3.87    -3.73    -3.49
beta[3]     -0.27    0.01 0.20    -0.65    -0.40    -0.27    -0.14     0.12
beta[4]      0.06    0.00 0.01     0.05     0.05     0.06     0.06     0.07
beta[5]     -1.27    0.02 0.63    -2.55    -1.72    -1.26    -0.85    -0.09
beta[6]      0.02    0.00 0.00     0.01     0.02     0.02     0.02     0.02
beta[7]     -0.01    0.00 0.00    -0.01    -0.01    -0.01    -0.01     0.00
beta[8]     -0.10    0.04 1.00    -2.23    -0.75    -0.11     0.60     1.79
beta[9]      0.53    0.01 0.17     0.20     0.41     0.52     0.65     0.87
beta[10]     1.65    0.01 0.18     1.31     1.53     1.65     1.76     2.01
beta[11]     0.90    0.00 0.03     0.85     0.88     0.90     0.92     0.95
c[1]         4.70    0.06 1.25     2.14     3.83     4.69     5.59     7.08
c[2]         6.91    0.06 1.24     4.40     6.07     6.90     7.78     9.25
c[3]        10.08    0.06 1.24     7.55     9.25    10.09    10.96    12.43
c[4]        12.67    0.06 1.25    10.13    11.83    12.68    13.54    15.06
c[5]        15.00    0.06 1.25    12.47    14.15    14.98    15.87    17.46
c[6]        18.86    0.06 1.33    16.29    17.92    18.87    19.75    21.46
lp__     -7106.86    0.15 2.83 -7113.57 -7108.57 -7106.56 -7104.87 -7102.19
         n_eff Rhat
beta[1]    665    1
beta[2]    592    1
beta[3]    599    1
beta[4]   1000    1
beta[5]   1000    1
beta[6]   1000    1
beta[7]   1000    1
beta[8]    648    1
beta[9]    543    1
beta[10]   793    1
beta[11]   660    1
c[1]       431    1
c[2]       435    1
c[3]       434    1
c[4]       433    1
c[5]       433    1
c[6]       448    1
lp__       381    1

Samples were drawn using NUTS(diag_e) at Tue Jun  5 13:33:19 2018.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

Rhat is 1 for all parameters, n_eff is quite high, no divergences.

I tried the same model ON my machine, on the same data, with the same seed with CmdStan gave

Inference for Stan model: orderedLogistic_model
1 chains: each with iter=(1000); warmup=(0); thin=(1); 1000 iterations saved.

Warmup took (1165) seconds, 19 minutes total
Sampling took (929) seconds, 15 minutes total

                    Mean     MCSE   StdDev        5%       50%       95%  N_Eff  N_Eff/s    R_hat
lp__            -7.1e+03  1.3e-01  2.9e+00  -7.1e+03  -7.1e+03  -7.1e+03    493  5.3e-01  1.0e+00
accept_stat__    9.4e-01  2.4e-03  7.7e-02   7.9e-01   9.8e-01   1.0e+00   1000  1.1e+00  1.0e+00
stepsize__       1.3e-02  5.6e-17  4.0e-17   1.3e-02   1.3e-02   1.3e-02   0.50  5.4e-04  1.0e+00
treedepth__      8.0e+00  6.0e-03  1.9e-01   8.0e+00   8.0e+00   8.0e+00   1000  1.1e+00  1.0e+00
n_leapfrog__     2.8e+02  2.5e+00  7.6e+01   2.6e+02   2.6e+02   5.1e+02    946  1.0e+00  1.0e+00
divergent__      0.0e+00  0.0e+00  0.0e+00   0.0e+00   0.0e+00   0.0e+00   1000  1.1e+00     -nan
energy__         7.1e+03  2.2e-01  4.1e+00   7.1e+03   7.1e+03   7.1e+03    325  3.5e-01  1.0e+00
beta[1]          3.4e-02  8.6e-04  2.5e-02  -7.6e-03   3.5e-02   7.5e-02    828  8.9e-01  1.0e+00
beta[2]         -3.9e+00  7.1e-03  1.9e-01  -4.2e+00  -3.9e+00  -3.5e+00    719  7.7e-01  1.0e+00
beta[3]         -2.6e-01  7.5e-03  2.0e-01  -5.9e-01  -2.6e-01   8.9e-02    743  8.0e-01  1.0e+00
beta[4]          5.8e-02  1.9e-04  5.9e-03   4.9e-02   5.8e-02   6.8e-02   1000  1.1e+00  1.0e+00
beta[5]         -1.4e+00  2.1e-02  6.4e-01  -2.4e+00  -1.4e+00  -2.9e-01    943  1.0e+00  1.0e+00
beta[6]          1.8e-02  6.4e-05  2.0e-03   1.5e-02   1.8e-02   2.1e-02   1000  1.1e+00  1.0e+00
beta[7]         -6.3e-03  2.3e-05  7.3e-04  -7.5e-03  -6.3e-03  -5.1e-03   1000  1.1e+00  1.0e+00
beta[8]         -1.4e-01  3.8e-02  1.0e+00  -1.8e+00  -1.2e-01   1.5e+00    758  8.2e-01  1.0e+00
beta[9]          5.1e-01  5.8e-03  1.7e-01   2.1e-01   5.2e-01   7.8e-01    866  9.3e-01  1.0e+00
beta[10]         1.7e+00  5.8e-03  1.8e-01   1.4e+00   1.7e+00   2.0e+00   1000  1.1e+00  1.0e+00
beta[11]         9.0e-01  9.3e-04  2.6e-02   8.6e-01   9.0e-01   9.4e-01    796  8.6e-01  1.0e+00
c[1]             4.6e+00  5.2e-02  1.2e+00   2.5e+00   4.6e+00   6.6e+00    544  5.9e-01  1.0e+00
c[2]             6.8e+00  5.1e-02  1.2e+00   4.8e+00   6.8e+00   8.8e+00    554  6.0e-01  1.0e+00
c[3]             1.0e+01  5.2e-02  1.2e+00   7.9e+00   1.0e+01   1.2e+01    546  5.9e-01  1.0e+00
c[4]             1.3e+01  5.2e-02  1.2e+00   1.1e+01   1.3e+01   1.5e+01    544  5.9e-01  1.0e+00
c[5]             1.5e+01  5.2e-02  1.2e+00   1.3e+01   1.5e+01   1.7e+01    549  5.9e-01  1.0e+00
c[6]             1.9e+01  5.5e-02  1.3e+00   1.7e+01   1.9e+01   2.1e+01    560  6.0e-01  1.0e+00

Samples were drawn using hmc with nuts.
For each parameter, N_Eff is a crude measure of effective sample size,
and R_hat is the potential scale reduction factor on split chains (at 
convergence, R_hat=1).

So basically same time to run (more or less), same results.

BUT when i run the same model on the same data, same iteration number, some seed, 8GB memory with CmdStan built by myself on the cluster, this is what I get

Inference for Stan model: orderedLogistic_model
1 chains: each with iter=(1000); warmup=(0); thin=(1); 1000 iterations saved.

Warmup took (2034) seconds, 34 minutes total
Sampling took (3813) seconds, 1.1 hours total
                                                                                                                                                                                                                                                                               
                    Mean     MCSE   StdDev        5%       50%       95%  N_Eff  N_Eff/s    R_hat                                                                                                                                                                              
lp__            -7.1e+03  3.9e-01  3.0e+00  -7.1e+03  -7.1e+03  -7.1e+03     58  1.5e-02  1.0e+00                                                                                                                                                                              
accept_stat__    9.3e-01  3.1e-03  9.7e-02   7.3e-01   9.7e-01   1.0e+00   1000  2.6e-01  1.0e+00                                                                                                                                                                              
stepsize__       4.4e-04  3.1e-18  2.2e-18   4.4e-04   4.4e-04   4.4e-04   0.50  1.3e-04  1.0e+00                                                                                                                                                                              
treedepth__      1.0e+01  1.2e-15  3.7e-14   1.0e+01   1.0e+01   1.0e+01   1000  2.6e-01  1.0e+00                                                                                                                                                                              
n_leapfrog__     1.0e+03  7.9e-14  2.5e-12   1.0e+03   1.0e+03   1.0e+03   1000  2.6e-01  1.0e+00                                                                                                                                                                              
divergent__      0.0e+00  0.0e+00  0.0e+00   0.0e+00   0.0e+00   0.0e+00   1000  2.6e-01     -nan                                                                                                                                                                              
energy__         7.1e+03  4.0e-01  4.2e+00   7.1e+03   7.1e+03   7.1e+03    108  2.8e-02  1.0e+00                                                                                                                                                                              
beta[1]          2.1e-01  1.4e-02  4.1e-02   1.4e-01   2.1e-01   2.7e-01    8.8  2.3e-03  1.6e+00                                                                                                                                                                              
beta[2]         -3.6e+00  4.1e-02  2.0e-01  -3.9e+00  -3.6e+00  -3.2e+00     23  6.1e-03  1.0e+00                                                                                                                                                                              
beta[3]         -3.0e-01  5.3e-02  2.0e-01  -6.4e-01  -2.9e-01   1.3e-02     15  3.8e-03  1.0e+00                                                                                                                                                                              
beta[4]          1.2e-01  3.9e-03  1.2e-02   1.0e-01   1.2e-01   1.4e-01    9.2  2.4e-03  1.6e+00                                                                                                                                                                              
beta[5]         -1.4e+00  2.1e-01  8.7e-01  -2.7e+00  -1.6e+00   1.4e-01     17  4.4e-03  1.1e+00                                                                                                                                                                              
beta[6]          1.8e-02  2.9e-04  2.2e-03   1.5e-02   1.8e-02   2.2e-02     56  1.5e-02  1.0e+00                                                                                                                                                                              
beta[7]         -7.2e-03  9.7e-05  7.4e-04  -8.3e-03  -7.2e-03  -6.0e-03     58  1.5e-02  1.0e+00                                                                                                                                                                              
beta[8]         -1.6e+02  9.8e+00  2.6e+01  -2.1e+02  -1.6e+02  -1.2e+02    7.1  1.9e-03  1.7e+00                                                                                                                                                                              
beta[9]          1.3e+00  8.0e-02  2.1e-01   9.8e-01   1.3e+00   1.7e+00    6.8  1.8e-03  1.4e+00                                                                                                                                                                              
beta[10]         2.2e+00  6.0e-02  1.8e-01   1.9e+00   2.2e+00   2.4e+00    8.6  2.3e-03  1.2e+00                                                                                                                                                                              
beta[11]         7.0e-01  1.3e-02  3.6e-02   6.5e-01   7.0e-01   7.6e-01    8.2  2.1e-03  1.3e+00                                                                                                                                                                              
c[1]            -1.6e+02  9.6e+00  2.6e+01  -2.0e+02  -1.5e+02  -1.1e+02    7.2  1.9e-03  1.7e+00                                                                                                                                                                              
c[2]            -1.5e+02  9.6e+00  2.6e+01  -2.0e+02  -1.5e+02  -1.1e+02    7.2  1.9e-03  1.7e+00                                                                                                                                                                              
c[3]            -1.5e+02  9.6e+00  2.6e+01  -1.9e+02  -1.5e+02  -1.1e+02    7.2  1.9e-03  1.7e+00                                                                                                                                                                              
c[4]            -1.5e+02  9.6e+00  2.6e+01  -1.9e+02  -1.5e+02  -1.0e+02    7.2  1.9e-03  1.7e+00
c[5]            -1.5e+02  9.6e+00  2.6e+01  -1.9e+02  -1.4e+02  -1.0e+02    7.2  1.9e-03  1.7e+00
c[6]            -1.4e+02  9.5e+00  2.6e+01  -1.8e+02  -1.4e+02  -9.9e+01    7.3  1.9e-03  1.7e+00

Samples were drawn using hmc with nuts.
For each parameter, N_Eff is a crude measure of effective sample size,
and R_hat is the potential scale reduction factor on split chains (at 
convergence, R_hat=1).

I am not concerned that on the cluster took 1h 40 mins to run and on my computer (35 min on my local machine), I am really surprise to have my cutpoints collapsing, n_eff incredibly small, and Rhats that shows bad convergence.

I am really puzzled. What is going on on the two different machines?

Guessing different compiler options.

@bgoodri: Thanks!
oooook. this is a difficult and scary subject for me. Where can I see what the compiler options are on the 2 different machines?
It is maybe related to the rstan procedure that sets:

CXXFLAGS=-O3 -mtune=native -march=native -Wno-unused-variable -Wno-unused-function  -Wno-macro-redefined

For rstan, yes. Or at least that is one of the Makefiles that gets sourced. The CmdStan equivalent is under make/local, which you may need to create if it does not exist. Also, your cluster may have -ffast-math or similar enabled by default, which is too sloppy to be useful for NUTS.

There have also been some issues where the Rstan code was using a different equation to calculate Neff. You might want to pull from github:

I don’t think you should seet his much differences from compiler options, you’re sure the model versions, data, and control arguments all match?

Thanks all.
@sakrejda : I double checked everything. model, data and program version are the same.
So I moved to complier flags. Following both @bgoodri 's advices I set

 CXXFLAGS" = -O3 -mtune=native -march=native -Wno-unused-variable -Wno-macro-redefined -fno-fast-math 

now both times to run CmdStan and the final results are comparable on my machine and on the cluster.
Here are the results on the cluster:

 Inference for Stan model: orderedLogistic_model
1 chains: each with iter=(1000); warmup=(0); thin=(1); 1000 iterations saved.

Warmup took (1216) seconds, 20 minutes total
Sampling took (995) seconds, 17 minutes total

                    Mean     MCSE   StdDev        5%       50%       95%  N_Eff  N_Eff/s    R_hat
lp__            -7.1e+03  1.4e-01  3.0e+00  -7.1e+03  -7.1e+03  -7.1e+03    447  4.5e-01  1.0e+00
accept_stat__    9.4e-01  2.7e-03  8.4e-02   7.4e-01   9.7e-01   1.0e+00   1000  1.0e+00  1.0e+00
stepsize__       1.4e-02  4.9e-17  3.5e-17   1.4e-02   1.4e-02   1.4e-02   0.50  5.0e-04  1.0e+00
treedepth__      8.0e+00  6.7e-03  2.1e-01   8.0e+00   8.0e+00   8.0e+00   1000  1.0e+00  1.0e+00
n_leapfrog__     2.6e+02  1.7e+00  5.4e+01   2.6e+02   2.6e+02   2.6e+02   1000  1.0e+00  1.0e+00
divergent__      0.0e+00  0.0e+00  0.0e+00   0.0e+00   0.0e+00   0.0e+00   1000  1.0e+00     -nan
energy__         7.1e+03  2.1e-01  4.2e+00   7.1e+03   7.1e+03   7.1e+03    389  3.9e-01  1.0e+00
beta[1]          3.4e-02  1.0e-03  2.6e-02  -7.0e-03   3.5e-02   7.7e-02    634  6.4e-01  1.0e+00
beta[2]         -3.9e+00  5.8e-03  1.8e-01  -4.2e+00  -3.9e+00  -3.6e+00   1000  1.0e+00  1.0e+00
beta[3]         -2.6e-01  6.2e-03  2.0e-01  -5.8e-01  -2.6e-01   7.5e-02   1000  1.0e+00  1.0e+00
beta[4]          5.8e-02  1.9e-04  6.0e-03   4.8e-02   5.8e-02   6.8e-02   1000  1.0e+00  1.0e+00
beta[5]         -1.3e+00  2.0e-02  6.4e-01  -2.3e+00  -1.3e+00  -3.2e-01   1000  1.0e+00  1.0e+00
beta[6]          1.8e-02  6.4e-05  2.0e-03   1.5e-02   1.8e-02   2.1e-02   1000  1.0e+00  1.0e+00
beta[7]         -6.3e-03  2.3e-05  7.2e-04  -7.5e-03  -6.3e-03  -5.1e-03   1000  1.0e+00  1.0e+00
beta[8]         -1.5e-01  3.7e-02  9.8e-01  -1.8e+00  -1.3e-01   1.4e+00    708  7.1e-01  1.0e+00
beta[9]          5.2e-01  7.1e-03  1.8e-01   2.2e-01   5.1e-01   8.0e-01    653  6.6e-01  1.0e+00
beta[10]         1.7e+00  5.6e-03  1.8e-01   1.4e+00   1.7e+00   2.0e+00   1000  1.0e+00  1.0e+00
beta[11]         9.0e-01  9.7e-04  2.7e-02   8.5e-01   9.0e-01   9.4e-01    753  7.6e-01  1.0e+00
c[1]             4.6e+00  5.3e-02  1.2e+00   2.5e+00   4.6e+00   6.5e+00    538  5.4e-01  1.0e+00
c[2]             6.8e+00  5.3e-02  1.2e+00   4.7e+00   6.8e+00   8.7e+00    533  5.4e-01  1.0e+00
c[3]             1.0e+01  5.3e-02  1.2e+00   7.8e+00   1.0e+01   1.2e+01    529  5.3e-01  1.0e+00
c[4]             1.3e+01  5.4e-02  1.2e+00   1.0e+01   1.3e+01   1.4e+01    526  5.3e-01  1.0e+00
c[5]             1.5e+01  5.4e-02  1.2e+00   1.3e+01   1.5e+01   1.7e+01    526  5.3e-01  1.0e+00
c[6]             1.9e+01  5.5e-02  1.3e+00   1.7e+01   1.9e+01   2.1e+01    547  5.5e-01  1.0e+00

I could see the fast math flag causing that problem.

1 Like

Did it fix the RStan issue as well?

I can not use R on the cluster. I am trying my way with CmdStan for the first time. Now I have similar results on my machine (with rstan) and in the cluster (with CmdStan) with the same running time. I hope to understand how to speed up things in future since I will have a multilevel model in mind.

I had a similar problem with very slow Rstan code on the server when it ran fine on my laptop. I reinstalled Stan and recompiled my R package after deleting the Makevars file on the server and the performance went back to being almost equivalent. Now I wish I had spent more time identifying which flag was the problem. They can make a big difference, though.

So in this case the problem was not the presence of a special compiler flag but the absence of one?

I had to add some compiler flags before compiling CmdStan (and subsequently my model) on the cluster.

CXXFLAGS = -O3 -mtune=native -march=native -Wno-unused-variable -Wno-macro-redefined -fno-fast-math

the result now is that rstan and CmdStan (compiled on my machine following strictly the guide) and the CmdStan on the server with the flags above now give similar results running at the same speed.

I hope I was clear.