CmdStan slower than PyStan?

Operating System : Windows 10
Interface Version : CmdStan 2.19.0 / PyStan
Compiler/Toolkit: GCC 7.9

I tried to run 1 single chain of my model on both CmdStan and PyStan with the exact same parameters and data. The problem is that the total execution time is 10 times slower on CmdStan than on PyStan.

  • Here you can see the PyStan call of the sampling function :
    fit = sm.sampling(data=data_cpld, iter=2000, warmup=1000, chains=1, algorithm = 'NUTS')
    this command is executed with no warning, a total execution time of approximately 100 seconds and good sampling results.

  • Now I’m trying to do the same thing on CmdStan :
    model.exe sample ^ num_samples=1000 ^ num_warmup=1000 ^ data file=data_cpld.json
    and the total execution time is approximately 1000 seconds even if sampling results are the same.
    The gradient computation predict a total execution time of 100 seconds, so it looks like I’m doing something wrong with CmdStan

The difference is that on CmdStan, i’m getting multiple warnings during the warmup early phase (0% -5%) :

Informational Message: The current Metropolis proposal is about to be rejected because of the following issue: Exception: normal_lpdf: Location parameter is inf, but must be finite! (in 'STAN_USER_MODELcoupled_simplecoupled_simple.stan' at line 55) If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine, but if this warning occurs often then your model may be either severely ill-conditioned or misspecified.

My model is well specified using PyStan so it should be the same on CmdStan right ?

The program seems to be stocked in this early phase but after this 5%, it seems to run with constant speed, but still slower than PyStan.

Is there something that i’m doing wrong using CmdStan sample function ?

Thank you

Can you post the model and data? The different interfaces should all perform about the same. 10x is big haha.

Here is my model and some data that I tried this morning and the elapsed time was :
PyStan : 20s
CmdStan : 315s
The results were good and the same and I tried with exact same input excepted the random seed that cannot fit to PyStan sampling function :
ValueError: 'seed' is too large; max is 2147483647

I tried with other data and with more sample and there is always this 10 factor.
More surprising, I tried with variational algorithm on both interfaces and for this data the results was :
Pystan : 2.64 s
CmdStan : 31.85 s

You are right, it should be the same, I might try on an other interface like RStan to compare and see what I’m doing wrong

model.stan (1.3 KB)
data_short.txt (20.3 KB)

1 Like

Have you changed optimization levels for cmdstan (-O0 vs -O3)?

E.g. in make/local

Ok, I tried with O=0 and it is still the same

What about -O3

O=3 was default configuration

If you have same performance with both of those I think it might be some hidden configuration going on.

Do you have the data in a form that cmdstan would accept?

The matrices in the transformed parameters block are large. It might be that IO is slowing Cmdstan down. I’m not sure how Pystan does IO. Try moving the transformed parameters to the model block and see how things go. Like this:

    real coupling(int t, int n, int N_node, matrix x1, matrix z, matrix K){ // sum( K * (x1_i - x1) )
        real s = 0.0;
        for(i in 1:N_node){
            s= s+ K[n,i]*(x1[i,t] - x1[n,t]);
        return s;
data {
    int<lower =1> N_time; 
    int<lower =1> N_node;
    matrix[N_node,N_time] x1_sim;
    matrix[N_node,N_node] K; // CM
    real dt;
    real tau;
transformed data{
    real I1 = 3.1;
parameters {
    vector<lower= -3.5, upper= -1.5>[N_node] x0; // m
    vector[N_node] x1_init;
    vector[N_node] z_init;

model {
    matrix[N_node,N_time] x1;
    matrix[N_node,N_time] z;

    x1[:,1] = x1_init;
    z[:,1] = z_init;

    for(t in 1:(N_time-1)){
        for(n in 1:N_node){
            x1[n,t+1] = x1[n,t] + dt * ( -(x1[n,t]^3) -2*(x1[n,t]^2) +1 - z[n,t] +I1);
            z[n,t+1] = z[n,t] + dt * tau* (4*(x1[n,t] - x0[n]) -z[n,t] -coupling(t,n,N_node, x1,z,K));

    x1_init ~ normal(-2.0,0.1); 
    z_init ~ normal(4.0,0.1);
    for(n in 1:N_node){
        x0[n] ~ cauchy(-1.3, 1.0);

    // Likelihood 
    for(i in 1:N_time){
        for(n in 1:N_node){
            x1_sim[n,i] ~ normal(x1[n,i], 1.0);

I tried with your proposition and the results are the same, CmdStan is still slower than Pystan. I also tried on the same configuration with an other model (Bernoulli example for 10000 samples) and there the speed of CmdStan and Pystan is the same, so it’s look like the problem is not my configuration but comes from my model.

@mms29 I’m getting the slowdown here too. Thanks for reporting.

@seantalts @Matthijs @mitzimorris I did some investigating and this model runs fast in Cmdstan 2.18.1 but runs slow in Cmdstan 2.19. I think it might have something to do with code generation because I took the header file generated by 2.18.1 and compiled it against the 2.19 libraries and it goes fast again. Could someone familiar with the code generation take a look at this and make the appropriate issue?

Here’s the model and data:
model2.stan (1.2 KB)
data_short.R (18.7 KB)

Here are the two different headers:
model2-2.18.1.hpp (24.2 KB)
model2-2.19.0.hpp (24.6 KB)

There are warnings that appear with 2.18.1 that do not appear with 2.19:

Warning: left-hand side variable (name=s) occurs on right-hand side of assignment, causing inefficient deep copy to avoid aliasing.
Warning: left-hand side variable (name=x1) occurs on right-hand side of assignment, causing inefficient deep copy to avoid aliasing.
Warning: left-hand side variable (name=z) occurs on right-hand side of assignment, causing inefficient deep copy to avoid aliasing.

Here’s the command I used for the cross-compile:

~/cmdstan-2.19.0$ g++ -std=c++1y -pthread -Wno-sign-compare     -O3 -I src -I stan/src -I stan/lib/stan_math/ -I stan/lib/stan_math/lib/eigen_3.3.3 -I stan/lib/stan_math/lib/boost_1.69.0 -I stan/lib/stan_math/lib/sundials_4.1.0/include    -DBOOST_RESULT_OF_USE_TR1 -DBOOST_NO_DECLTYPE -DBOOST_DISABLE_ASSERTS -DBOOST_PHOENIX_NO_VARIADIC_EXPRESSION             -include ../cmdstan-2.18.1/model2.hpp src/cmdstan/main.cpp        stan/lib/stan_math/lib/sundials_4.1.0/lib/libsundials_nvecserial.a stan/lib/stan_math/lib/sundials_4.1.0/lib/libsundials_cvodes.a stan/lib/stan_math/lib/sundials_4.1.0/lib/libsundials_idas.a  -o model2

This model pretty consistently takes about 2 minutes to run in 2.19, but only about 17 second with 2.18.1.


Wooph, that’s not good! Thank you both and thanks for verifying. @mitzimorris do you have time to dig into this?

starting to dig now.

yes, generated code is problematic. looks like this is the first release which is using the refactor of the parser/generator code that went in with back in September .

there are several interesting differences - investigating.

@seantalts - the 2.18.1 release went out in December - language refactor wasn’t included then?

That’s right, there was nothing in the 2.18.1 release except for a hot fix.

OK, I’ve found the problem and have fixed the code.
the problem (once again) was the interaction between indexed expressions and the refactor.

the fix is running as fast as the 2.18.1 release.
needs more testing.

the gory details:

model2.stan, line 37:

z[n,t+1] = z[n,t] + dt * tau* (4*(x1[n,t] - x0[n]) -z[n,t] -coupling(t,n,N_node, x1,z,K));

should be translated to:

current_statement_begin__ = 37;
                    stan::model::cons_list(stan::model::index_uni(n), stan::model::cons_list(stan::model::index_uni((t + 1)), stan::model::nil_index_list())), 
                    (get_base1(z, n, t, "z", 1) + ((dt * tau) * (((4 * (get_base1(x1, n, t, "x1", 1) - get_base1(x0, n, "x0", 1))) - get_base1(z, n, t, "z", 1)) - coupling(t, n, N_node, x1, z, K, pstream__)))), 
                    "assigning variable z");

but currently the generated code is doing a very bad bad thing:

current_statement_begin__ = 37;
                    stan::model::cons_list(stan::model::index_uni(n), stan::model::cons_list(stan::model::index_uni((t + 1)), stan::model::nil_index_list())), 
                    (get_base1(get_base1(z,n,"z",1),t,"z",2) + ((dt * tau) * (((4 * (get_base1(get_base1(x1,n,"x1",1),t,"x1",2) - get_base1(x0,n,"x0",1))) - get_base1(get_base1(z,n,"z",1),t,"z",2)) - coupling(t, n, N_node, x1, z, K, pstream__)))), 
                    "assigning variable z");

feel like cryin’


Wow! You found that so quickly! Thanks, @mitzimorris!

Where was the issue? I don’t think I saw any problems in the performance tests and that makes me sad (though maybe the emailing was broken again). But we should figure out how to make sure this part gets tested for performance from now on at least nightly if not on PRs.

the issue is matrix indexing.
the generated c++ code calls stan::math::get_base1.
for matrices, it can be called with either just the row index, to get the row, or with row,column to get the entry, and both are legal.

the refactor resulted in generated code where matrix indexing always pulls out the row, then pulls out the entry of the row vector. ouch!!! this should only happen when the intent of the indexing is to return a row_vector. in all other situations, the generated code should access the matrix by row, column - this wasn’t happening.

the generator logic is pretty twisty. I’ve got it figured out and will comment and document in a PR. running all unit tests locally now. fingers crossed.


1 Like