October 2.25 release?

@wds15 is the data for that model you posted a subslice of the actual size of the data your working with? Cuz also the 8% slowdown I’m seeing is about 15 seconds. But if that difference scales up to larger N data or parameters then yeah def v bad

Also the flip side to turning those on by default is that stan models will take longer to compile. I think having it very open and out there that we have a bunch of hand checked optimizations for compilers users can flip on with a flag for more speed is pretty reasonable

The data is not a large example, but as this type of data is used in adaptive trials things change over time.

Here is the make/local:

CXX=clang++
CC=clang
STAN_THREADS=true
CXXFLAGS+=-march=native -mtune=native -ftemplate-depth-256
CXXFLAGS+=-DBOOST_MATH_PROMOTE_DOUBLE_POLICY=false
CXXFLAGS+=-Wno-unused-variable -Wno-unused-function -Wno-unused-local-typedefs
STAN_COMPILER_OPTIMS=true

What is surprising to me that we need for this model the optimisations. I mean, I thought that all what slows down is set_adjoint_zero and this function should not be called a massive amount of time for a mixture logistic regression model unless I overlooked some detail which can easily be. Since there are so many moving parts it’s good to see that the compiler magic solves it.

We don’t seem to be on the same page here. When I check out the CmdStan repo at tag 2.24.0 and type make build, it downloads stanc nightly, which is the stanc associated with 2.25.

This strikes me as a problem. It seems like the only record of what stanc is associated with a given cmdstan release is in the tarball, buried in an executable. What if someone loses those release tarballs (but still has the git repository)?

The current practice of not specifying the stanc version number explicitly is at least at odds with how other dependencies are handled. That is, the version numbers for eigen, math, boost, etc, are all specified explicitly somewhere in the cmdstan repository. stanc seems like an odd one out.

1 Like

But you know the date of the release…that should suffice to get the right nightly thing, no?

I’d like to stop with the private thread here and make a new 2.25 post. There’s nothing in here that needs to be private, and the performance problems as they pertain to 2.25 definitely need to be public.

1 Like

They have a few options:

changed post’s category to developers.

question came up on this PR: https://github.com/stan-dev/cmdstan/pull/938#issuecomment-709373564

this PR pushes the consequences of allowing user to increase output CSV files precision via CmdStan output arg sig_figs through to the output CSV file produced by the stansummary command. it’s a change after code freeze.

question for the judges: is this a new feature or part of a feature/logical consequence of a feature that’s already in the release?

@wds15 I ran your model a few times and looked at the per gradient timings.

This is what I got:

# Math 3.3 
0.0002214252s per gradient
0.0002090428
0.0001795337
0.0001969192
0.0001905239
0.0001779333

# Math 3.4 rc1
0.0002040952
0.0002013071
0.0001896
0.0001847826
0.0001835872
0.0001711862

I was running the sampling in cmdstan and then reading in the output and computing time per gradient using this code:

library(rstan)
fit = read_stan_csv("output.csv")
get_elapsed_time(fit)[1, "sample"] / sum(get_num_leapfrog_per_iteration(fit))

So it looks to be very similar now as previously – no obvious 10% dropoff in performance for me. Can you do these benchmarks on your computer?

I also noticed whatever I was running was not taking 180 seconds. More like 30 seconds.

Edit: And for clarity what I did was check out a develop version of cmdstan and then go into the Math folder and check out the appropriate Math library

@wds15 get a chance to check your model performance? I think run it with a 2.25 with Math 3.4 vs. 2.25 with Math 3.3.

Ok, so here is the timing for a mixed setup:

> ## 2.24.1
> ref_time
   user  system elapsed 
 28.885   0.178  28.975 
> ## 2.25.0rc1
> new_time
   user  system elapsed 
 31.234   0.137  31.721 
> ## 2.25.0rc1 with old math
> new_time2
   user  system elapsed 
 28.875   0.137  29.103 
> 

The problem looks to me to be in the new Stan-math changes. For the above I used this make/local

CXX=clang++
CC=clang
STAN_THREADS=true
CXXFLAGS+=-march=native -mtune=native -ftemplate-depth-256
CXXFLAGS+=-DBOOST_MATH_PROMOTE_DOUBLE_POLICY=false
CXXFLAGS+=-Wno-unused-variable -Wno-unused-function -Wno-unused-local-typedefs
#CXXFLAGS+=-flto
#STAN_COMPILER_OPTIMS=true

How many times did you run it? I was getting a few seconds of variation up or down on my computer.

I am running just a single time, but for 10k iterations. That gives me stable timings. Here is another run:

> ## 2.24.1
> ref_time
   user  system elapsed 
 28.781   0.143  28.815 
> ## 2.25.0rc1
> new_time
   user  system elapsed 
 32.291   0.217  32.524 
> ## 2.25.0rc1 with old math
> new_time2
   user  system elapsed 
 28.841   0.127  28.842 
> 

Is 2.24 vs 2.25rc1 changing cmdstan + stan + math or just math?

My 3.3 math just gave me 29.8s and my 3.4rc1 math gave me 26.9s so we’re seeing different things. I can try fully downgrading to 2.24 and also turning on optimizations (I’m running defaults now).

Are these re-runs with seeds held constant? My 3.3 math results are different than my 3.4 math results even with the seed held constant.

Can you generate the per-gradient numbers?

library(rstan)
fit = read_stan_csv("output.csv")
get_elapsed_time(fit)[1, "sample"] / sum(get_num_leapfrog_per_iteration(fit))

These should be less sensitive to differences in seeds/whatnot.

these are reruns with the same seed and same init=0 statements.

I mixed cmdstan 2.25.0rc1, stan 2.25.0rc1 and math from 2.24.1.

let me add the gradient stuff

EDIT: here it is

> get_grad_times(fit_ref)
[1] 0.0001917556
> get_grad_times(fit_new)
[1] 0.0002119939
> get_grad_times(fit_new2)
[1] 0.000192436
> 

Alright so I reran things again with 2.25 cmdstan/stan and Math 3.4 and Math 3.3.

I fit a quick version of the model 100 times in both cases and computed quantiles on the per-gradient timings.

The 20% and 80% quantiles for version 3.3 were 175us and 182us.

The same quantiles for version 3.4 were 178us and 184us.

So it looks a bit slower to me, but not 10% slower like you’re seeing.

This was my benchmarking script:

for i in {1..100}
do
    ./blrm33 sample num_warmup=500 num_samples=500 data file=blrm2.data.R output file=33.$i.csv
    ./blrm34 sample num_warmup=500 num_samples=500 data file=blrm2.data.R output file=34.$i.csv
done

This is the post-processing script:

library(rstan)

for(version in c("33", "34")) {
  timings = c()
  for(i in 1:100) {
    fit = read_stan_csv(paste0(version, ".", i, ".csv"))
    timings = c(timings, get_elapsed_time(fit)[1, "sample"] / sum(get_num_leapfrog_per_iteration(fit)))
  }
  
  print(version)
  print(quantile(timings, c(0.20, 0.80)))
}

Then maybe this is a platform thing?

I ran your code and got

+   print(quantile(timings, c(0.2 .... [TRUNCATED] 
[1] "33"
         20%          80% 
0.0001914718 0.0002004961 
[1] "34"
         20%          80% 
0.0002043476 0.0002161686 
>

then I bumped to 5k / 5k iterations and 100 repetitions and got the same again:

> +   print(quantile(timings, c(0.2 .... [TRUNCATED] 
[1] "33"
         20%          80% 
0.0001914718 0.0002004961 
[1] "34"
         20%          80% 
0.0002043476 0.0002161686 

And the only fix for this is LTO flags or can we do something else? Is this due to the var_value changes?

Yes. Adding STAN_COMPILER_OPTIMS=true fixes the problem.

Ok, do we just set those on by default then? At least for non-Windows?