Separating model and services translation units for faster compilation

The goal here is to be able to compile a Stan program in its own translation unit. My preliminary experiments show that takes about 7s on my machine, whereas CmdStan compiling the same model takes about 35s (this is after the precompiled headers are made). In order to achieve this separation, we’re going to need some gynmanstics on the inheritance side. There’s a pure virtual base class with a dynamic method, then an extension using the curiously recursive template pattern (CRTP) that provides static inheritance from code generation. The transpiler then generates an instance using the CRTP helper. CmdStan gets compiled in its own translation unit, only knowing there’s a factory somewhere to give it a reference to a base model—the implementation gets linked after CmdStan is compiled.

Linking is fast, so the overall compile time for simple models is going to be something like 7s.

Running example

Rather than showing how to do this with the actual code, I’m going to provide a complete runnable example in three files that’ll show what will go where. The method speak() is going to stand in as a proxy for the more complex log_prob() method in the actual model class. The includes would be part of Stan, the first translation unit would be code generated by the transpiler, and the second translation unit would be part of CmdStan.

The whole thing reads like a C++ type puzzle. For background reading, I’d recommend:

tu_includes.hpp

#include <string>

struct model_base {
  virtual std::string speak() const = 0;
};

template <class T>
struct model_crtp_base : public model_base {
  std::string speak() const {
    return static_cast<T*>(this)->template say_something<false>();
  }
};

const model_base& new_model(int n, double y);

tu1.cpp

#include "tu_include.hpp"

struct foo : public model_crtp_base<const foo> {
  int n_;
  double y_;
  foo(int n, double y) : n_(n), y_(y) { }

  template <bool B>
  std::string say_something() const {
    return B ? "hello" : "goodbye";
  }
};

const model_base& new_model(int n, double y) {
  const foo* f = new foo(n, y);
  return *f;
}

tu2.cpp

#include "tu_include.hpp"
#include <iostream>

int main() {
  int n = 3;
  double y = 7;
  const model_base& model = new_model(n, y);
  std::cout << "model says " << model.speak() << std::endl;
}

Et voilà

~/temp2$ clang++ -c tu1.cpp
~/temp2$ clang++ -c tu2.cpp
~/temp2$ clang++ -o speaker tu1.o tu2.o
~/temp2$ ./speaker
model says goodbye
4 Likes

Awesome! Just to be paranoid, can you add to the example some private fields on the generated model class?

Sure. I should’ve done that to begin with as it’s what killed things last time. I’ll just update the example above for one that works with member fields.

Are there experiments demonstrating what, if any, consequences there are to run time? Does the CRTP avoid any virtual function overhead or have compilers gotten to the point where the overhead is negligible?

I think Bob intends to end up measuring it, but my suspicion here is that for any kind of non-trivial model it will be totally negligible since it adds that overhead on the order of once per leapfrog step and anything complicated has a log_prob that executes pretty slowly in comparison to that overhead. And it would have to add a ton of time to small models to make up for the 27 seconds in compilation time this will save someone.

Before Stan I had my own HMC C++ code and moving from virtual functions to a template solution sped things up by over an order of magnitude, even for expensive gradients (and they were all hand written, too, which would any autodiff complications) so I’m always weary of how sensitive things can be to virtuals deep in the code. I’d love to see experiments as soon in the process as possible.

The CRTP resolves inheritance without virtual calls. But, the pattern I’m suggesting still has a virtual layer below that, so there’ll still be a virtual function call for the log density.

Of course. We can’t introduce a noticeable slowdown with this.