October 2.25 release?

stevebronder · October 15, 2020, 12:24am

@wds15 is the data for that model you posted a subslice of the actual size of the data your working with? Cuz also the 8% slowdown I’m seeing is about 15 seconds. But if that difference scales up to larger N data or parameters then yeah def v bad

Also the flip side to turning those on by default is that stan models will take longer to compile. I think having it very open and out there that we have a bunch of hand checked optimizations for compilers users can flip on with a flag for more speed is pretty reasonable

wds15 · October 15, 2020, 6:33am

The data is not a large example, but as this type of data is used in adaptive trials things change over time.

Here is the make/local:

CXX=clang++
CC=clang
STAN_THREADS=true
CXXFLAGS+=-march=native -mtune=native -ftemplate-depth-256
CXXFLAGS+=-DBOOST_MATH_PROMOTE_DOUBLE_POLICY=false
CXXFLAGS+=-Wno-unused-variable -Wno-unused-function -Wno-unused-local-typedefs
STAN_COMPILER_OPTIMS=true

What is surprising to me that we need for this model the optimisations. I mean, I thought that all what slows down is set_adjoint_zero and this function should not be called a massive amount of time for a mixture logistic regression model unless I overlooked some detail which can easily be. Since there are so many moving parts it’s good to see that the compiler magic solves it.

ariddell · October 15, 2020, 9:43am

We don’t seem to be on the same page here. When I check out the CmdStan repo at tag 2.24.0 and type make build, it downloads stanc nightly, which is the stanc associated with 2.25.

This strikes me as a problem. It seems like the only record of what stanc is associated with a given cmdstan release is in the tarball, buried in an executable. What if someone loses those release tarballs (but still has the git repository)?

The current practice of not specifying the stanc version number explicitly is at least at odds with how other dependencies are handled. That is, the version numbers for eigen, math, boost, etc, are all specified explicitly somewhere in the cmdstan repository. stanc seems like an odd one out.

wds15 · October 15, 2020, 10:01am

But you know the date of the release…that should suffice to get the right nightly thing, no?

bbbales2 · October 15, 2020, 10:57am

I’d like to stop with the private thread here and make a new 2.25 post. There’s nothing in here that needs to be private, and the performance problems as they pertain to 2.25 definitely need to be public.

rok_cesnovar · October 15, 2020, 11:18am

They have a few options:

download the release binaries from Releases · stan-dev/stanc3 · GitHub
download the cmdstan tarball again
also clone stanc3 and checkout the tagged version and build it locally
set STANC3_VERSION in make/local (this only works on develop right now, as of Allows download of specific stanc3 version by syclik · Pull Request #924 · stan-dev/cmdstan · GitHub)

mitzimorris · October 15, 2020, 2:51pm

changed post’s category to developers.

mitzimorris · October 15, 2020, 6:01pm

question came up on this PR: https://github.com/stan-dev/cmdstan/pull/938#issuecomment-709373564

this PR pushes the consequences of allowing user to increase output CSV files precision via CmdStan output arg sig_figs through to the output CSV file produced by the stansummary command. it’s a change after code freeze.

question for the judges: is this a new feature or part of a feature/logical consequence of a feature that’s already in the release?

bbbales2 · October 16, 2020, 1:27am

@wds15 I ran your model a few times and looked at the per gradient timings.

This is what I got:

# Math 3.3 
0.0002214252s per gradient
0.0002090428
0.0001795337
0.0001969192
0.0001905239
0.0001779333

# Math 3.4 rc1
0.0002040952
0.0002013071
0.0001896
0.0001847826
0.0001835872
0.0001711862

I was running the sampling in cmdstan and then reading in the output and computing time per gradient using this code:

library(rstan)
fit = read_stan_csv("output.csv")
get_elapsed_time(fit)[1, "sample"] / sum(get_num_leapfrog_per_iteration(fit))

So it looks to be very similar now as previously – no obvious 10% dropoff in performance for me. Can you do these benchmarks on your computer?

I also noticed whatever I was running was not taking 180 seconds. More like 30 seconds.

Edit: And for clarity what I did was check out a develop version of cmdstan and then go into the Math folder and check out the appropriate Math library

bbbales2 · October 18, 2020, 7:53pm

@wds15 get a chance to check your model performance? I think run it with a 2.25 with Math 3.4 vs. 2.25 with Math 3.3.

wds15 · October 19, 2020, 11:52am

Ok, so here is the timing for a mixed setup:

> ## 2.24.1
> ref_time
   user  system elapsed 
 28.885   0.178  28.975 
> ## 2.25.0rc1
> new_time
   user  system elapsed 
 31.234   0.137  31.721 
> ## 2.25.0rc1 with old math
> new_time2
   user  system elapsed 
 28.875   0.137  29.103 
>

The problem looks to me to be in the new Stan-math changes. For the above I used this make/local

CXX=clang++
CC=clang
STAN_THREADS=true
CXXFLAGS+=-march=native -mtune=native -ftemplate-depth-256
CXXFLAGS+=-DBOOST_MATH_PROMOTE_DOUBLE_POLICY=false
CXXFLAGS+=-Wno-unused-variable -Wno-unused-function -Wno-unused-local-typedefs
#CXXFLAGS+=-flto
#STAN_COMPILER_OPTIMS=true

bbbales2 · October 19, 2020, 11:54am

How many times did you run it? I was getting a few seconds of variation up or down on my computer.

wds15 · October 19, 2020, 11:57am

I am running just a single time, but for 10k iterations. That gives me stable timings. Here is another run:

> ## 2.24.1
> ref_time
   user  system elapsed 
 28.781   0.143  28.815 
> ## 2.25.0rc1
> new_time
   user  system elapsed 
 32.291   0.217  32.524 
> ## 2.25.0rc1 with old math
> new_time2
   user  system elapsed 
 28.841   0.127  28.842 
>

bbbales2 · October 19, 2020, 12:06pm

Is 2.24 vs 2.25rc1 changing cmdstan + stan + math or just math?

My 3.3 math just gave me 29.8s and my 3.4rc1 math gave me 26.9s so we’re seeing different things. I can try fully downgrading to 2.24 and also turning on optimizations (I’m running defaults now).

Are these re-runs with seeds held constant? My 3.3 math results are different than my 3.4 math results even with the seed held constant.

Can you generate the per-gradient numbers?

library(rstan)
fit = read_stan_csv("output.csv")
get_elapsed_time(fit)[1, "sample"] / sum(get_num_leapfrog_per_iteration(fit))

These should be less sensitive to differences in seeds/whatnot.

wds15 · October 19, 2020, 12:10pm

these are reruns with the same seed and same init=0 statements.

I mixed cmdstan 2.25.0rc1, stan 2.25.0rc1 and math from 2.24.1.

let me add the gradient stuff

EDIT: here it is

> get_grad_times(fit_ref)
[1] 0.0001917556
> get_grad_times(fit_new)
[1] 0.0002119939
> get_grad_times(fit_new2)
[1] 0.000192436
>

bbbales2 · October 19, 2020, 2:10pm

Alright so I reran things again with 2.25 cmdstan/stan and Math 3.4 and Math 3.3.

I fit a quick version of the model 100 times in both cases and computed quantiles on the per-gradient timings.

The 20% and 80% quantiles for version 3.3 were 175us and 182us.

The same quantiles for version 3.4 were 178us and 184us.

So it looks a bit slower to me, but not 10% slower like you’re seeing.

This was my benchmarking script:

for i in {1..100}
do
    ./blrm33 sample num_warmup=500 num_samples=500 data file=blrm2.data.R output file=33.$i.csv
    ./blrm34 sample num_warmup=500 num_samples=500 data file=blrm2.data.R output file=34.$i.csv
done

This is the post-processing script:

library(rstan)

for(version in c("33", "34")) {
  timings = c()
  for(i in 1:100) {
    fit = read_stan_csv(paste0(version, ".", i, ".csv"))
    timings = c(timings, get_elapsed_time(fit)[1, "sample"] / sum(get_num_leapfrog_per_iteration(fit)))
  }
  
  print(version)
  print(quantile(timings, c(0.20, 0.80)))
}

wds15 · October 20, 2020, 6:39am

Then maybe this is a platform thing?

I ran your code and got

+   print(quantile(timings, c(0.2 .... [TRUNCATED] 
[1] "33"
         20%          80% 
0.0001914718 0.0002004961 
[1] "34"
         20%          80% 
0.0002043476 0.0002161686 
>

then I bumped to 5k / 5k iterations and 100 repetitions and got the same again:

> +   print(quantile(timings, c(0.2 .... [TRUNCATED] 
[1] "33"
         20%          80% 
0.0001914718 0.0002004961 
[1] "34"
         20%          80% 
0.0002043476 0.0002161686

rok_cesnovar · October 20, 2020, 8:29am

And the only fix for this is LTO flags or can we do something else? Is this due to the var_value changes?

wds15 · October 20, 2020, 8:31am

Yes. Adding STAN_COMPILER_OPTIMS=true fixes the problem.

rok_cesnovar · October 20, 2020, 8:36am

Ok, do we just set those on by default then? At least for non-Windows?

Topic		Replies	Views
Planning for 2.24 - halfway between releases of Math/Stan/Cmdstan/Stanc3 Developers	44	2084	July 20, 2020
Stan 2.25 release candidate! Announcements release	18	1928	October 19, 2020
Cmdstan 2.24 release candidate now available General	58	3621	August 21, 2020
First stanc3 release candidate! Developers	9	2251	August 19, 2019
Planning for 2.23 release Developers	33	1637	April 22, 2020

October 2.25 release?

Related topics