Compilation time evolution in cmdstan

My workstation was not busy today so I made it perform some performance tests for Cmdstan releases from 2.18 on.

The first results I have are for compilation times and we seem to be regressing a bit since CRTP was added in 2.20 (see chart below). Is this just added complexity or something worth exploring?

Here are mean times for the 5 models (sorry for the bad chart - I am bad at plotting).

model\version 2.18.1 2.19.1 2.20.0 2.22.1 2.23.0
arK 25.33497 27.97416 11.0281 15.34745 14.7619
sir 27.0837 29.63843 14.06118 16.58849 17.73287
gp_pois_regr 27.47137 30.45327 15.13727 17.63468 18.74881
irt_2pl 25.71007 28.88344 11.66927 14.67787 16.07967
arma 25.51895 28.08129 11.08476 14.0597 15.342
5 Likes

Well if we’d gone in the other direction we would have counted it as a big breakthrough.

So I think this is worth investigating.

1 Like

I got to the bottom of this (or at least a local minimum). If anyone is interested: https://github.com/stan-dev/cmdstan/issues/863

Bottom line: 2.24 should have 20-25% faster compile times for clang and 10-15% faster for g++.

5 Likes

One more potential improvement (50% reduce compile time) for g++: https://github.com/stan-dev/cmdstan/issues/872

3 Likes

Would it be of any interest running order 1e2 or 1e3 compile tests to get some statistics for each release?

2 Likes

Hi @mtwest,

yes, they should be run after the feature freeze compared to the previous release to identify any issues, and if compile times have improved mention that in the releases notes. We should already do this for this release as there should be a noticeable speedup (20% shorter compile times for clang and 50% shorter compile times for g++ the last time I checked).

Ideally, I envisioned running a larger suite of performance tests (compiling+running posteriordb models for example) once a week (probably over the weekend when the resources are mostly free) and make visual reports to a predefined issue.

There are some more issues/PRs in the pipeline before the above will be realized, but we are getting there.

1 Like

I ask because I was getting a feel how of cmdstanpy works and was curious how results for the eight-schools example varied when one varies sampling parameters. It’s not hard to store the compile time for each attempt.

I’ve got ~50 pooled workstations at my disposal, so…

1 Like

Using cmdstanpy or cmdstanr to test this will definitely be the easiest. Especially compared to plain bash.

Good to know :) Will definitely make a note to tag you. Thanks!

For each compiled model, the information one can record

  • Total compile time
  • CPU boost/turbo frequencies
  • CPU chip-set name
  • Memory usage
  • OS version
  • C++ compiler version

presuming one is using just a single core for compilation. Presumably the compile time and memory usage will have very narrow variance, but it would be good to check. Particular on different platforms.

There are two steps to compilation. First, the transpilation from Stan to C++, which involves a Stan version, and then the actual compilation of the C++.

Nope. Windows C++ compilers consume way more memory than Linux or Mac OS. We’ve had to take extreme measures on the compilation to deal with that in the past.

Actual memory usage is notoriously difficult to measure in an application, but you’ll know when you run out.

The C++ compiler flags and library versions are just as important as the compiler version.

Timing’s usually broken into system vs. wall time.

1 Like

Apologies for not being clear. My thought was that there would be small variance for compile time on the same system (OS, cpu, etc), not that it wouldn’t differ across platforms.

And thank you for reminding me of the complexities of testing memory and timing.

For the same OS, CPU, compiler, and compiler flags I would expect them to be around the same. But swapping in and out compiler flags can also change compilation / performance speedups by quite a lot

Would it be of value to explore that parameter space? How many dimensions would it be?

Usually the dimensions are a constant and don’t affect compilation time. So whther you have 1 or 1000 dimensions, the variable’s likely to be declared as vector[N] alpha and translate to a simple Eigen declaration.

Exploring the parameter space of which compiler & compiler flags affect the compilation and performance for a given model.

There’s no universal answer here. It’s going to depend on the how the Stan program’s coded and the kind of operations it has and the kind of optimizations the compiler can do. Particularly on how much template metaprogramming is involved int the translation of the program to C++ and the amount of code that needs to be optimized and whether that can exploit low-level CPU features through underlying implementations. And it’s going to change with MPI (different compilers) and GPU, both of which have different performance profiles.

P.S. We like “Stan program” to separate the model as a mathematical object from a particular implementation in Stan.

Thanks Bob for the reply. I am thinking about these things as I try to sketch out a proposal for what configurations should go into a Stan Docker container. I understand there is no optimal solution given the breadth of research the Stan community does, but there ought to be at least some recommended (robust?) settings that most users would be happy with. While individual developers have their preferences, I guess my naive thought was that either through crowd sourcing or experimentation, we could find a broadly acceptable config set.

Naturally for those folks where the standard is #unacceptable, they would be free to make changes.

I’d just go with the default config of the C++ compiler used by the makefiles in CmdStan.