Separate compilation of model and services code complete

For instructions on how to run it and verify for yourselves, see the comment on this issue:

https://github.com/stan-dev/cmdstan/issues/712

I can’t measure a loss in sampling speed on a model that takes 1s to run 2000 iterations.

It’s 100% backward compatible with existing interfaces, so it won’t break anyone’s code in any repos. RStan and PyStan will need to be updated the same way as CmdStan.

The underlying pattern is a bit complicated, as it involves a layering of dynamic and static inheritance adapters to maintain backward compatibility and minimize code. But I didn’t have to touch any of the services or algorithms code and it’s a minimal one-line change to cmdstan C++ code. (Figuring out how to do it on the other hand, was much harder.)

The C++ magic’s all documented method-by-method in the stan-dev/stan branch. Here’s the synopsis:

Translation Unit 1

  • stan::model::model_base (abstract base class)
    • untemplated virtual functions for log_prob and write_array (required to call from reference to model_base)
    • templated log_prob() and write_array() signatures matching our originally generated functions (required to allow old code to simply call model_base)
  • stan::model::model_crtp (abstract base class extending model_base)
    • untemplated virtual functions implemented in terms of templated functions using crtp (this means no extra work for code gen and no runtime branching on jacobian/propto)
  • stan::model::my_model: user-generated model extends model_crtp<my_model>
    • this is generated model code (same templated log_prob and write_array as before)
    • different superclass
    • added function to construct and return model_base (this gets used by CmdStan rather than the typedef)

Translation Unit 2

  • cmdstan/command.hpp
    • remove template from command function
    • add signature declaration for function to allocate and construct base_model (this gets defined in model translation unit; see above)
    • construct model with this function instead of template type

The extra good news is that adding higher-order log_prob increases compilation time only about 5%.

I still need to add more doc and tests before creating PRs. Any suggestions for testing would be appreciated.

Mitzi’s going to tweak the makefiles.

I’ve completely documented the Stan bits of this and updated the C++ code generator to generate the right code.

7 Likes

Wow, this is epic! How small are models compiled with -g0 (no debug info)?

tweaked the CmdStan makefiles - specifically cmdstan/makefile, cmdstan/make/program

@ariddell do you think this will work out-of-the-box for PyStan?

Not out of the box. It’ll take some work. PyStan uses Stan C++ headers
only now as this simplifies just about everything.

Unfortunately Python’s distutils / setuptools doesn’t support building
and installing C libraries. There’s numpy’s install_clib but it’s poorly
documented. I believe we used that at one point for something.

In short, it’s possible. Will take some distutils/setuptools magic though.

I think this is the same problem as we have with cvodes.

Yes, same issue—building and linking multiple translation units.

The parser/code generator does this, but it’s all encapsulated in a single executable that doesn’t need to change on the fly.

I benchmarked a more expensive model with the sampling iterations bumped up to get a more stable estimate of the overall performance.

On develop,

> time make build

real	3m33.639s
user	3m17.187s
sys	0m13.486s


> time make CC=clang++ -j4 O=3 linear_regression_finnish_horseshoe

real	0m26.621s
user	0m26.468s
sys	0m0.824s

> time ./linear_regression_finnish_horseshoe sample num_samples=25000 data file=linear_regression.data.R random seed=48383884

real	15m32.367s
user	15m23.334s
sys	0m7.174s

on the feature/0712-model-base-class branch,

> time make build

real	3m49.410s
user	3m34.290s
sys	0m12.688s

> time make CC=clang++ -j4 O=3 linear_regression_finnish_horseshoe

real	0m15.038s
user	0m14.992s
sys	0m0.770s

> time ./linear_regression_finnish_horseshoe sample num_samples=25000 data file=linear_regression.data.R random seed=48383884

real	15m11.413s
user	15m3.867s
sys	0m6.328s

Make build is maybe marginally slower but the model execution time exhibits no significant differences. That said, the speed up in compilation is less than 50%.

They’re not huge. bernoulli.o, the very simplest model, is under 500K compiled at O=3. But that isn’t with -g0. Size will vary by model based on how much of the math library needs to be linked in.

@mitzimorris will know more as she’s dealing with the makefiles for cmdstan.

I get 1.7 MB at -O3 under both develop and the feature branch.

I hope the problem is with my incomplete understanding of the makefile rules. because I’m only seeing a 50% speedup in compile time as well.

good news - it was my incomplete understanding of our makefiles. for clang compilers, using the precompiled model headers (stan/model/model_headers.hpp.gch) brings the compile time for the bernoulli model down from 23 seconds to 7 seconds.
slightly larger model (bym2.stan) goes from 28 seconds to 9 seconds.

1 Like

With the change, git pull -ff on both cmdstan and stan (it would be great to have the submodules synced), the compile time goes down a little bit to

> time make CC=clang++ -j4 O=3 linear_regression_finnish_horseshoe

real	0m11.237s
user	0m10.531s
sys	0m0.502s

about 45% of the develop compile time relative to

> time make CC=clang++ -j4 O=3 linear_regression_finnish_horseshoe

real	0m26.621s
user	0m26.468s
sys	0m0.824s

on develop.

I didn’t test run time.

On my machine, the current timing is:

1a) 7s: build model using makefile starting from .stan file assuming everything else is already built
1b) 6s: same as (1a) but calling commands directly rather than involving make dependencies—will need more help on make to remove this 10% overhead if possible.

2a) 26s: current system build using makefile
2b) 24s: same as (2a) but calling commands directly rather than invoking makefile

I don’t know where I got the 37s number from—maybe I wasn’t invoking the precompiled header properly.

Anyway, this one’s ready to review now.

I think I’ll be able to figure this out. We already have to link against the StanHeaders shared object to use SUNDIALS and we already have a base class in RStan, so it is probably just a matter of redoing some of the C++ patterns.

That’s exactly what it looked like in CmdStan. @mitzimorris is the one to ask for make help.

OK, we’ll talk about it tomorrow.