Stan 2.17 running slower on a model than Stan 2.15


#1

To test Cmdstan 2.17, I tried it on a model I had on hand and found that it seemed to take about three times longer than it did in Rstan 2.16.2. To see if it was an issue with RStan or with the different versions, I ran the model again with Cmdstan 2.15. (I couldn’t get Cmdstan 2.16.) Again, Cmdstan 2.17 took about three times longer than Cmdstan 2.15. The compiler used was the same for all runs: g++ 6.3.1. Compile flags were left at their defaults.

Here’s the model that I tried:

data {
  real Tmelt;
  real Troom;

  int<lower=1> numCurves;
  int<lower=0> plasticCurveSizes[numCurves];
  
  vector[numCurves] epsilonDot;
  vector[numCurves] T;

  vector[sum(plasticCurveSizes)] epsilonPlastic;
  vector[sum(plasticCurveSizes)] sigmaPlastic;
}

transformed data {
  vector[numCurves] Tstar = (T - Troom)/(Tmelt - Troom);
  vector[numCurves] logEdot = log(epsilonDot);
  
  real maxC = positive_infinity(); // Default value

  {
    real minLogEdot = min(logEdot);

    if (minLogEdot < 0.0) {
      maxC = -1/minLogEdot;
    }    
  }

  print("maxC = ", maxC);
  
}

parameters {
  real<lower=0.0> A;
  real<lower=0.0> beta_JC;
  real<lower=0.0, upper=1.0> n;
  real<lower=0.0, upper=maxC> C;
  real<lower=0.0> m;

  real<lower=0.0> SD_curve[numCurves];

  real<lower=0.0> alpha_JC[numCurves];
  real<lower=0.0> SD_alpha;
}

transformed parameters {
  real<lower=0.0> B = beta_JC*A;
}

model {

  int startInd = 1;

  // Priors
  // Using 1/3 of the mean as a guess for the standard deviation so that
  // where the abscissa is zero, the prior distribution is close to zero.
  //
  m ~ normal(1.0, 1.0/3)T[0.0,];

  for (curveInd in 1:numCurves) {

    int endInd = startInd + plasticCurveSizes[curveInd] - 1;

    alpha_JC[curveInd] ~ normal(A*(1.0 + C*logEdot[curveInd])*(1.0 - (Tstar[curveInd])^m), SD_alpha)T[0.0,];
       
    for (i in startInd:endInd) {
      sigmaPlastic[i] ~ normal(alpha_JC[curveInd]*(1.0 + beta_JC*(epsilonPlastic[i])^n), SD_curve[curveInd]);
    }

    startInd = endInd + 1;
  }
  
}

I do not vouch for the quality of this model. It was my first crack at a multiscale model with Johnson-Cook, and there are definitely a couple things about it that I would consider “evil” (like using the same beta_JC and n for all curves). It would also not surprise me if it has some computational evils as well.


Reflection on the v2.17.0 stan-dev/math performance bug
#2

Are you sure you’re compiling with -O3 in all cases? There is config outside Stan but inside R that can affect this (in a file called Makevars, I think).

How are you measuring time here? Did you do multiple runs from multiple seeds and find them all slower per n_eff?


#3

RStan, I think, was -O2, but both Cmdstans appeared to be -O3, except for some files in stanc compiled at -O0. Note that RStan and Cmdstan 2.15 ran at a similar speed. I ran 4 chains in parallel and started with the same seed (839520129) for all the runs. The script for launching Cmdstan 2.17 looks like this:

#!/bin/bash

for i in {1..4}
do
    ../jc_multilevelTest.cmdstan2.17 sample id=$i random seed=839520129 data file="fakeData.Rdump" output file=test$i.csv.cmdstan2.17 &
done

wait

~/Desktop/UQ/cmdstan-2.17.0/bin/stansummary test?.csv.cmdstan2.17

Time was measured simply by looking at elapsed time, but it was not even close.

For Cmdstan 2.15:
Warmup took (44, 38, 40, 42) seconds, 2.7 minutes total
Sampling took (32, 24, 30, 22) seconds, 1.8 minutes total

For Cmdstan 2.17:
Warmup took (211, 205, 217, 255) seconds, 15 minutes total
Sampling took (150, 160, 168, 150) seconds, 10 minutes total

I did notice than in spite of the same seed, Cmdstan 2.15 and Cmdstan 2.17 did not generate identical chains. Here are the summary outputs:

For Cmdstan 2.15:

                    Mean     MCSE   StdDev        5%       50%       95%  N_Eff  N_Eff/s    R_hat
lp__            -7.9e+02  1.1e-01  3.7e+00  -8.0e+02  -7.9e+02  -7.9e+02   1181       11  1.0e+00
accept_stat__    9.4e-01  1.4e-03  8.9e-02   7.5e-01   9.7e-01   1.0e+00   4000       37  1.0e+00
stepsize__       1.1e-02  8.1e-04  1.1e-03   1.0e-02   1.2e-02   1.3e-02    2.0    0.018  4.3e+13
treedepth__      8.1e+00  1.2e-01  4.4e-01   7.0e+00   8.0e+00   9.0e+00     14     0.13  1.1e+00
n_leapfrog__     3.2e+02  4.2e+01  1.2e+02   2.6e+02   2.6e+02   5.1e+02    8.2    0.076  1.1e+00
divergent__      0.0e+00  0.0e+00  0.0e+00   0.0e+00   0.0e+00   0.0e+00   4000       37     -nan
energy__         8.0e+02  1.5e-01  5.0e+00   8.0e+02   8.0e+02   8.1e+02   1137       10  1.0e+00
A                1.8e+03  2.1e-01  9.1e+00   1.8e+03   1.8e+03   1.8e+03   1919       18  1.0e+00
beta_JC          9.2e-01  2.3e-05  1.4e-03   9.2e-01   9.2e-01   9.2e-01   3438       32  1.0e+00
n                7.5e-01  1.9e-05  1.2e-03   7.5e-01   7.5e-01   7.6e-01   3566       33  1.0e+00
C                3.6e-03  1.7e-05  7.0e-04   2.5e-03   3.6e-03   4.7e-03   1612       15  1.0e+00
m                8.4e-01  4.1e-04  1.5e-02   8.1e-01   8.4e-01   8.6e-01   1360       13  1.0e+00
SD_curve[1]      1.7e+01  3.3e-02  1.8e+00   1.4e+01   1.7e+01   2.0e+01   2905       27  1.0e+00
SD_curve[2]      1.3e+01  2.7e-02  1.5e+00   1.1e+01   1.3e+01   1.6e+01   3013       28  1.0e+00
SD_curve[3]      1.0e+01  2.1e-02  1.2e+00   8.7e+00   1.0e+01   1.2e+01   3072       28  1.0e+00
SD_curve[4]      1.0e+01  2.1e-02  1.2e+00   8.3e+00   9.9e+00   1.2e+01   3218       30  1.0e+00
SD_curve[5]      1.3e+01  2.4e-02  1.4e+00   1.1e+01   1.3e+01   1.5e+01   3303       30  1.0e+00
SD_curve[6]      1.8e+01  3.4e-02  1.9e+00   1.5e+01   1.8e+01   2.1e+01   3242       30  1.0e+00
SD_curve[7]      1.2e+00  2.5e-03  1.4e-01   1.0e+00   1.2e+00   1.5e+00   2997       28  1.0e+00
SD_curve[8]      5.1e-01  1.1e-03  5.8e-02   4.2e-01   5.0e-01   6.1e-01   2838       26  1.0e+00
alpha_JC[1]      1.9e+03  3.3e-02  2.1e+00   1.9e+03   1.9e+03   1.9e+03   4000       37  1.0e+00
alpha_JC[2]      1.9e+03  2.8e-02  1.8e+00   1.9e+03   1.9e+03   1.9e+03   4000       37  1.0e+00
alpha_JC[3]      1.6e+03  2.2e-02  1.4e+00   1.6e+03   1.6e+03   1.6e+03   4000       37  1.0e+00
alpha_JC[4]      1.3e+03  2.2e-02  1.4e+00   1.3e+03   1.3e+03   1.3e+03   4000       37  1.0e+00
alpha_JC[5]      1.0e+03  2.5e-02  1.6e+00   1.0e+03   1.0e+03   1.0e+03   4000       37  1.0e+00
alpha_JC[6]      8.0e+02  3.7e-02  2.3e+00   7.9e+02   8.0e+02   8.0e+02   4000       37  1.0e+00
alpha_JC[7]      1.8e+03  5.9e-03  3.6e-01   1.8e+03   1.8e+03   1.8e+03   3790       35  1.0e+00
alpha_JC[8]      1.8e+03  5.3e-03  3.3e-01   1.8e+03   1.8e+03   1.8e+03   3833       35  1.0e+00
SD_alpha         1.6e+01  2.7e-01  7.7e+00   8.1e+00   1.4e+01   3.0e+01    810      7.5  1.0e+00
B                1.7e+03  2.0e-01  8.8e+00   1.7e+03   1.7e+03   1.7e+03   1976       18  1.0e+00

For Cmdstan 2.17:

                    Mean     MCSE   StdDev        5%       50%       95%  N_Eff  N_Eff/s    R_hat
lp__            -7.9e+02  1.1e-01  3.7e+00  -8.0e+02  -7.9e+02  -7.9e+02   1180  1.9e+00  1.0e+00
accept_stat__    9.5e-01  1.2e-03  7.4e-02   7.9e-01   9.8e-01   1.0e+00   4000  6.4e+00  1.0e+00
stepsize__       1.1e-02  1.4e-04  1.9e-04   1.0e-02   1.1e-02   1.1e-02    2.0  3.2e-03  9.0e+12
treedepth__      8.2e+00  1.3e-02  4.8e-01   8.0e+00   8.0e+00   9.0e+00   1286  2.1e+00  1.0e+00
n_leapfrog__     3.6e+02  3.3e+00  1.3e+02   2.6e+02   2.6e+02   5.1e+02   1624  2.6e+00  1.0e+00
divergent__      0.0e+00  0.0e+00  0.0e+00   0.0e+00   0.0e+00   0.0e+00   4000  6.4e+00     -nan
energy__         8.0e+02  1.5e-01  4.9e+00   8.0e+02   8.0e+02   8.1e+02   1137  1.8e+00  1.0e+00
A                1.8e+03  1.9e-01  8.9e+00   1.8e+03   1.8e+03   1.8e+03   2209  3.5e+00  1.0e+00
beta_JC          9.2e-01  2.4e-05  1.3e-03   9.2e-01   9.2e-01   9.2e-01   3063  4.9e+00  1.0e+00
n                7.5e-01  2.1e-05  1.1e-03   7.5e-01   7.5e-01   7.6e-01   3050  4.9e+00  1.0e+00
C                3.6e-03  1.9e-05  7.0e-04   2.6e-03   3.7e-03   4.7e-03   1336  2.1e+00  1.0e+00
m                8.4e-01  4.1e-04  1.5e-02   8.1e-01   8.4e-01   8.6e-01   1442  2.3e+00  1.0e+00
SD_curve[1]      1.7e+01  3.1e-02  1.7e+00   1.4e+01   1.7e+01   2.0e+01   3119  5.0e+00  1.0e+00
SD_curve[2]      1.3e+01  3.0e-02  1.5e+00   1.1e+01   1.3e+01   1.6e+01   2468  3.9e+00  1.0e+00
SD_curve[3]      1.0e+01  2.2e-02  1.2e+00   8.7e+00   1.0e+01   1.3e+01   2782  4.4e+00  1.0e+00
SD_curve[4]      1.0e+01  2.2e-02  1.2e+00   8.3e+00   9.9e+00   1.2e+01   2907  4.6e+00  1.0e+00
SD_curve[5]      1.3e+01  2.6e-02  1.4e+00   1.1e+01   1.3e+01   1.5e+01   2683  4.3e+00  1.0e+00
SD_curve[6]      1.8e+01  3.6e-02  2.0e+00   1.5e+01   1.8e+01   2.1e+01   3039  4.8e+00  1.0e+00
SD_curve[7]      1.2e+00  2.3e-03  1.3e-01   1.0e+00   1.2e+00   1.5e+00   3395  5.4e+00  1.0e+00
SD_curve[8]      5.0e-01  1.0e-03  6.0e-02   4.2e-01   5.0e-01   6.1e-01   3343  5.3e+00  1.0e+00
alpha_JC[1]      1.9e+03  3.3e-02  2.1e+00   1.9e+03   1.9e+03   1.9e+03   4000  6.4e+00  1.0e+00
alpha_JC[2]      1.9e+03  2.8e-02  1.8e+00   1.9e+03   1.9e+03   1.9e+03   4000  6.4e+00  1.0e+00
alpha_JC[3]      1.6e+03  2.2e-02  1.4e+00   1.6e+03   1.6e+03   1.6e+03   4000  6.4e+00  1.0e+00
alpha_JC[4]      1.3e+03  2.2e-02  1.4e+00   1.3e+03   1.3e+03   1.3e+03   4000  6.4e+00  1.0e+00
alpha_JC[5]      1.0e+03  2.6e-02  1.7e+00   1.0e+03   1.0e+03   1.0e+03   4000  6.4e+00  1.0e+00
alpha_JC[6]      8.0e+02  3.6e-02  2.3e+00   7.9e+02   8.0e+02   8.0e+02   4000  6.4e+00  1.0e+00
alpha_JC[7]      1.8e+03  6.0e-03  3.6e-01   1.8e+03   1.8e+03   1.8e+03   3556  5.7e+00  1.0e+00
alpha_JC[8]      1.8e+03  5.5e-03  3.3e-01   1.8e+03   1.8e+03   1.8e+03   3458  5.5e+00  1.0e+00
SD_alpha         1.6e+01  2.5e-01  8.0e+00   8.2e+00   1.4e+01   3.0e+01    989  1.6e+00  1.0e+00
B                1.7e+03  1.7e-01  8.6e+00   1.7e+03   1.7e+03   1.7e+03   2403  3.8e+00  1.0e+00

So far as I can see so far, the effective sample sizes are roughly similar, except for n_leapfrog__, where it’s much larger for Cmdstan 2.17.

ETA: Looks like the effective sample size for treedepth__ is also much larger for Cmdstan 2.17.


#4

If you can put together a fully reproducible example with CmdStan it would be very helpful. I’m sure we can go hunting too but it’s silly to try to reproduce something you already see.


#5

Thanks. That indeed looks worrying. As @sakrejda said, it’d help if you could share the data and Stan program.

The tree depths look roughly similar, so it looks like a problem with either the sampler or something in the math library. I don’t know what would’ve changed so drastically.

First step is to look at the .hpp getting generated for the Stan program. If that’s the same, then we’ll know the problem’s not in the language or code generator. That would leave something in the math library or something in the algorithms or even I/O.


#6

Ok. I’ve just e-mailed you the .hpp files and the data that I used with the Stan program.


#7

Has there been any progress on figuring out what’s been going on? Is there anything more I could do to help, perhaps profiling or something like that?


#8

Can the data be simulated somehow and you may share that R program… or is the data somewhere accessible?

I am not sure if I am able to nail down the problem, but I can offer to verify the problem (and do an educated guess). Stan should become faster with release not slower as a general rule of thumb.


#9

Bob_Carpenter should have the data already, which is simulated data to begin with.


#10

Everybody sends Bob things directly and Bob regularly says he doesn’t have time to go through everything he gets. This is why we have a forum so we don’t have to think about whether we are handling somebody else’s PI/IP when they ask for help and so that the person with the time and relevant expertise picks up the question. So long story short, post a reproducible example and a half dozen of us are set up to profile Stan and know how to do it.


#11

I can also verify. CmdStan 2.17 took about 1 hour while RStan 2.16 took 5 minutes.

I am unsure what was going on, and may need to go back and double check to make sure my data was exactly the same (ie, data coding error could lead to much poorer fit), etc. I did check that the tuning parameters were the same.

This was more than a month ago, and I had noticed that since then CmdStan from the downloads page reverted* to 2.16. Now it is 2.17 again.
Installing on a new computer, and a little worried about that runtime difference!

*I might’ve pulled it from Github and forgotten, but probably not.


#12

I checked out the 2.15 and 2.17 releases and compiled @jjramsey’s model using both. the code differs only in that:

doesn’t look like a problem with the language or the generator.

diff -Bbw cmdstan-2.1[57].0/jjramsey-model.hpp
1c1
< // Code generated by Stan version 2.15.0
---
> // Code generated by Stan version 2.17.0
21a22,28
> stan::io::program_reader prog_reader__() {
>     stan::io::program_reader reader;
>     reader.add_event(0, 0, "start", "jjramsey-model.stan");
>     reader.add_event(73, 73, "end", "jjramsey-model.stan");
>     return reader;
> }
> 
39,41c46
<         typedef boost::ecuyer1988 rng_t;
<         rng_t base_rng(0);  // 0 seed default
<         ctor_body(context__, base_rng, pstream__);
---
>         ctor_body(context__, 0, pstream__);
44d48
<     template <class RNG>
46c50
<         RNG& base_rng__,
---
>         unsigned int random_seed__,
49c53
<         ctor_body(context__, base_rng__, pstream__);
---
>         ctor_body(context__, random_seed__, pstream__);
52d55
<     template <class RNG>
54c57
<                    RNG& base_rng__,
---
>                    unsigned int random_seed__,
55a59,62
>         boost::ecuyer1988 base_rng__ =
>           stan::services::util::create_rng(random_seed__, 0);
>         (void) base_rng__;  // suppress unused var warning
> 
67a75,76
>         try {
>             current_statement_begin__ = 2;
72a82
>             current_statement_begin__ = 3;
77a88
>             current_statement_begin__ = 5;
82a94
>             current_statement_begin__ = 6;
92a105
>             current_statement_begin__ = 8;
102a116
>             current_statement_begin__ = 9;
112a127
>             current_statement_begin__ = 11;
122a138
>             current_statement_begin__ = 12;
134a151,153
>             current_statement_begin__ = 2;
>             current_statement_begin__ = 3;
>             current_statement_begin__ = 5;
135a155
>             current_statement_begin__ = 6;
138a159,162
>             current_statement_begin__ = 8;
>             current_statement_begin__ = 9;
>             current_statement_begin__ = 11;
>             current_statement_begin__ = 12;
139a164
>             current_statement_begin__ = 16;
143a169
>             current_statement_begin__ = 17;
147a174
>             current_statement_begin__ = 19;
152d178
<         try {
153a180
>             current_statement_begin__ = 22;
175,180d201
<             current_statement_begin__ = 29;
<         } catch (const std::exception& e) {
<             stan::lang::rethrow_located(e,current_statement_begin__);
<             // Next line prevents compiler griping about no return
<             throw std::runtime_error("*** IF YOU SEE THIS, PLEASE REPORT A BUG ***");
<         }
182a204,206
>             current_statement_begin__ = 16;
>             current_statement_begin__ = 17;
>             current_statement_begin__ = 19;
186a211
>             current_statement_begin__ = 34;
187a213
>             current_statement_begin__ = 35;
188a215
>             current_statement_begin__ = 36;
189a217
>             current_statement_begin__ = 37;
190a219
>             current_statement_begin__ = 38;
191a221
>             current_statement_begin__ = 40;
193a224
>             current_statement_begin__ = 42;
195a227
>             current_statement_begin__ = 43;
196a229,233
>         } catch (const std::exception& e) {
>             stan::lang::rethrow_located(e, current_statement_begin__, prog_reader__());
>             // Next line prevents compiler griping about no return
>             throw std::runtime_error("*** IF YOU SEE THIS, PLEASE REPORT A BUG ***");
>         }
217d253
<         // generate_declaration A
231d266
<         // generate_declaration beta_JC
245d279
<         // generate_declaration n
259d292
<         // generate_declaration C
273d305
<         // generate_declaration m
288d319
<         // generate_declaration SD_curve
305d335
<         // generate_declaration alpha_JC
321d350
<         // generate_declaration SD_alpha
356a386
>         try {
423a454
>             current_statement_begin__ = 47;
432,437d462
<         try {
<         } catch (const std::exception& e) {
<             stan::lang::rethrow_located(e,current_statement_begin__);
<             // Next line prevents compiler griping about no return
<             throw std::runtime_error("*** IF YOU SEE THIS, PLEASE REPORT A BUG ***");
<         }
447a473
>             current_statement_begin__ = 47;
451d476
<         try {
452a478
>             current_statement_begin__ = 52;
466a493
>                 current_statement_begin__ = 62;
490c518
<             stan::lang::rethrow_located(e,current_statement_begin__);
---
>             stan::lang::rethrow_located(e, current_statement_begin__, prog_reader__());
601a630,631
>         try {
>             current_statement_begin__ = 47;
610,615d639
<         try {
<         } catch (const std::exception& e) {
<             stan::lang::rethrow_located(e,current_statement_begin__);
<             // Next line prevents compiler griping about no return
<             throw std::runtime_error("*** IF YOU SEE THIS, PLEASE REPORT A BUG ***");
<         }
617a642
>             current_statement_begin__ = 47;
627,632d651
<         try {
<         } catch (const std::exception& e) {
<             stan::lang::rethrow_located(e,current_statement_begin__);
<             // Next line prevents compiler griping about no return
<             throw std::runtime_error("*** IF YOU SEE THIS, PLEASE REPORT A BUG ***");
<         }
636a656,660
>         } catch (const std::exception& e) {
>             stan::lang::rethrow_located(e, current_statement_begin__, prog_reader__());
>             // Next line prevents compiler griping about no return
>             throw std::runtime_error("*** IF YOU SEE THIS, PLEASE REPORT A BUG ***");
>         }

#13

major-ish things that changed in the math library between 2.16 and 2.17:

  1. Boost 1.62 -> 1.64
  2. makefiles refactored
  3. char* -> std::string& (mostly in error checking functions?)

(I used git diff -w --stat v2.16.0 v2.17.0)


#14

Also, can someone post the simulated data if they have it? Would like an excuse to try out git bisect. Thanks!


#15

FWIW - looking at the model - I believe we tell folks to use function inv(x) instead of 1/x - which would apply:

maxC = -1/minLogEdot;

m ~ normal(1.0, 1.0/3)T[0.0,];

but I don’t think that applies to slowdown.


#16

I indeed dropped the ball on this. When I get things personally, they typically go into my mail queue to die if they don’t require immediate action.

May I share what you sent me here?


#17
  1. Boost could be a problem in some specific function calls or in RNGs if they changed dramatically. Given all the incompatible changes, I don’t know how easy it’d be to run with each.

  2. Make seems very unlikely to be the source of the problem.

  3. If we’re passing char* literals to functions that require const std::string& arguments, that could easily be the source of the problem as it requires a constructor call, with memory allocation and copy for each invocation. I think what we want to do instead of

void bar(const std::string& x) { ... }

const char* foo = "abc";
bar(foo);

is use

const std::string foo("abc");
bar(foo);

How extensive is this?


#18

Also, what happened to performance regressions between releases? Is that what @seantalts said was broken? If so, fixing that is a top priority so that we don’t have this kind of embarassing regression again.


#19

I think the string stuff is confined to this PR: https://github.com/stan-dev/math/pull/584 We both thought it looked fine at the time.

The performance test is just one model and I don’t think it showed a slowdown, but we delete builds older than 20 days old. I upped that to 200 just now.


#20

I ran the performance tests locally and it seemed to take 213.555s on 2.17.0 vs 176.096s total on 2.16.0. It runs a logistic regression model 100 times, so this means it increased from about 1.76s average to 2.13s per run. I think I remember looking at a change of roughly this magnitude with @syclik and I think we ended up saying it was from the commit where all of the random numbers changed, and so we wrote it off as noise. It also doesn’t show anything close to an order of magnitude slow-down.