It seems like compiling the math library ahead of time is one of the best ways to approach the compile time issue (though perhaps we could have a separate thread brainstorming other ideas for that). I’m also interested in its ability to let us deploy a Stan interpreter, which has loads of interesting benefits I’d love to talk about at some point. So I wanted to ask for help and ideas in how to do this.
My initial thought is that we should usefunction_signatures.h to generate a text file that we can easily parse and use to generate all of the applicable signatures we are looking for from the Math library. Then we just write those out into a .cpp file and include stan/math.hpp at the top. Does that seem reasonable?
I did some basic experiments and it seems like even with just the cholesky_decompose function including what its .hpp claims to need, the object still takes 24s to compile on my machine, which is like 80% of total compile time for a Stan model. And I think we’d have to do significant untangling to get it to be modular enough such that we could separately compile pieces of the math library.
We might start without the distributions, though I’d also be interested in a version that doesn’t contain both row and column vectors (and generally tries a little to limit the combinatorial explosion especially when adding the specializations doesn’t increase performance).
It kinda has to be everything precompiled, or nothing, right? Like with cholesky, even compiling a lone function ends up taking an obscene amount of time.
Won’t this have to end up looking like all of stan-dev/stan gets precompiled? Cause that includes bits of stan-dev/math. It also probably has its own Eigen/Boost includes and just including that stuff probably counts for a lot of our slow compiles.
Can this be done in a failsafe way? So if a model doesn’t compile against the header that corresponds to the precompiled object, then we can just compile it against source and be okay?
Now to write a script translating those to C++ templated definitions…
That’s a good point. I wonder if we could split it into two parts, distributions and everything else, and have the distributions become (potentially) much more careful about what they are including… People might still be happy in a 30% reduction in compile time or w/e, ha.
stanc is already pre-compiled into an executable in e.g. RStan, so that’s done already :)
That’s an interesting idea - have sort of two modes, one precompiled library with all of the most common instantiations, and then a fallback to normal header-only mode. Certainly worst-case we could write instructions such that on linker failure it tries again the old way.
Also exciting - we only have 27k signatures right now, and removing all the ones taking row vectors reduces that substantially to under 10k. That seems doable pretty easily, and we can convert without performance penalty between row and column vector seamlessly (I believe).
The idea is right, but the implementation wouldn’t be quite right. What you’d want is to include just the template meta programs, and probably have it instantiated. Once upon a time, we used to have it, but now you’d have to grab the first lines out of stan/math/rev/mat.hpp. Then include whatever else that’s included. That will help with the number of files included and should matter most for those with spinny disks. I don’t know if it’ll affect compile times, but an empirical test will tell. We may ready have those gains from the precompiled header.
As for compiling something, before getting too far into this, mind checking runtime performance with the library linked in? If we’re going to take a hit with this sort of strategy, it’d be good to know early so we can evaluate whether it’s worth it.
Can you describe the test you’re asking for in more detail? So far the only thing I got compiling was cholesky decompose as an object.
Spent the day working on automatically generating the full math library worth of explicit template instantiations but there are something like 530 exceptions where things don’t work the way everything else does (probably many duplicates tho). Need to make a specific pattern for the RNGs as well. May just do the exceptions by hand and assume all future stuff will follow our typical pattern.
Sure. We just want to evaluate whether building an executable this way is going to impact runtime. Hopefully it doesn’t, but we should check.
I think we want to have two eavluations:
A model with chokesly decompose build the current way vs built in the proposed way.
A model without chokesly decompose built in the current way vs built in the proposed way.
I think we run it with the same seed, verify outputs are identical, then that should have exactly the same number of calls for those runs and the comparison is the right one. Run each thing a few times and average because we’ve seen almost a 2x variation for the same process running on a machine.
If there are multiple ways to build / link it, we’d want to see what effect that has. Hopefully this is representative of what’ll happen as the built library gets larger.
Does that make sense? Maybe there’s a better way to evaluate it?
I think that’s a good idea before we switch over the default behavior for Stan, though prebuilt math libraries ares still useful for an easy-to-install interpreted stanc even if they lose some slight performance due to lack of inlining into a Stan model (which already is not getting inlined very much from what I have heard).
I figured out how to do this test (using C++11’s extern template feature), but need a model using cholesky_decompose where there is a decomposition on parameters. The one I naively came up with prints that the posterior is improper. Do you know of one? Or maybe @betanalpha or @bgoodri?
Seems like this doesn’t work with default initialization values, at least in the simple case where I just have one parameter that is a cholesky_factor_corr… Tried googling and didn’t find much for how to properly initialize a cholesky_factor_corr parameter. Any tips?
L ~ lkj_corr_cholesky(1);
Yeah, I can confirm that it’s about the same speed:
Traditional measured slightly slower, but clearly there’s a lot of variance. Probably doesn’t help that I don’t have a dedicated benchmarking system, but I think the results show that any performance degradation will blend into noise and are compelling enough for me to continue.
That’s going to lead to a rather large object file.
What I was thinking is that we could implement bare pointer versions of the distributions that deal with both scalar and vector inputs, then call those from the templated versions by just unwrapping the raw memory and providing sizes.
It will in the sense that it will block inlining. Whether that’ll make a difference that we care about in big models is another matter.