The Future of the Math Library and the Bottlenecks in Development

The Math Library

Currently, we have a stable Math library used to write functions, evaluate the functions at different inputs, and get the gradients of functions with respect to those inputs. The Math library provides the back-end for the Stan language; for every function in Stan, there’s a corresponding Math library function. The Math library is broader in scope than any other C++ library for automatic differentiation (as far as I know; see arXiv paper). On the technical side, it was designed to be a header-only C++ library, non-threaded and non-GPU application. (It still is… if you don’t use integrate_ode_bdf().) For this purpose, it’s as fast or faster than every other C++ library for automatic differentiation out there.

But, we’re really not leveraging modern computing architecture well. Every computing device has multiple cores. A lot of laptop and desktop computers have GPUs. HPCs definitely have these things. This is how other libraries, under certain conditions, can be faster than Stan Math for the same operations.

In broad strokes, there are really two sensible directions for the Math library to progress:

  1. breadth, meaning adding more functionality

  2. speed, meaning wall time.

I think the next innovation in the Math library is going to be in the speed department. We should be able to leverage multiple threads and GPUs to get the most out of our computing power. Imagine Stan models being able to go 10x, 100x, 1000x faster, all without having to change the Stan model itself.

How do we get speed

  1. more threads; split computation across multiple cores (and not just at the map_rect() level)

  2. more GPU; offload computation to a device designed to do a simple job, but fast

  3. analytic gradients; skipping using automatic differentiation to compute things we already know how to compute

  4. forward mode automatic differentiation; instead of computing a gradient, compute a directional derivative. This doesn’t scale well in general, but it’s embarrassingly parallel.

Bottlenecks to developing speed in Math

  1. Technical debt

    We’ve amassed enough technical debt in Math where it’s hard to move quickly. We need to pay this down in order to enable developers to move faster. Some specific things that we could do:

    • testing framework to make it easier to validate new and existing code; without it, refactoring is a true pain. (We’ll need this sort of thing to tackle more threading and everything.)

    • documentation and testing of the internal functions to make it easier for new developers to be on-boarded and existing developers from reinventing the wheel

    • benchmarking suite

    • simplify the code base

    • remove dead code

    This stuff is hard to get volunteer effort on. I’m super-grateful for the people tackling this right now. We’re really indebted to these developers (and I’m probably missing a few more): @increasechief, @rok_cesnovar, @stevebronder, @seantalts, @bob_carpenter, @bbbales2 (for the start on the testing framework!!!). If you’re looking for a way to help, this is really super-important and much appreciated, even though it has less visible impact than adding new features.

  2. Expertise

    To really be an effective Math developer, you need a combination of skills: C++ templating, math (calculus), numerical computation, software design. And often that’s not enough… there’s also threading, MPI, building and linking executables, GPU, ODE integration, numerical approximation, just to name a few. These things don’t usually all live in a single human being. We often need to collaborate on features to get them in. When we don’t, we get contributions that lead to more technical debt that we have to pay down later.

    An example I’ve been thinking about lately is threading. I do believe this will play a big part in making the Math library faster. It’s taking us a while to get @wds15’s prototype into the Math library (which is really a lot of work and has the potential to be awesome). I feel bad that it’s taking this long, but we want this to be right. The reason we want to take our time is that once we have a reasonable design for threading, we can have other developers that don’t have threading expertise help with features because the patterns have been laid out clearly. If we don’t take the time and we introduce something that’s not laid out well, we limit other developers from helping, including those with threading expertise.

    This is true for other features too.

  3. Computation costs for testing

    We’re just resource-limited here. It would help to have more resources so we can just test at will and fast too.

If you want to help

We would really appreciate the help. If you’re looking to help, just reach out. I’ll try to keep a running list of active projects on Math where we could use the help.

Of course, you can go and build new features. That’s necessary and appreciated too.

Ideally, we can start paying down the technical debt so the Math library starts progressing more quickly.


I’d like to help out. Is there something specific you recommend I work on?


@syclik — thanks for writing this up.

I plan to start working on finishing the autodiff testing framework.

Higher-order autodiff?

What is the priority for forward-mode autodiff for higher order derivatives? I see two reasonable options:

  1. Remove it and keep concentrating on reverse-mode gradient calculations

  2. Finish it, which involves

    • testing what we have,
    • figuring out how to build with it.

We can optimize later—there’s a lot of room for analytic derivatives.

What are the applications?

  • Riemannian HMC

  • Laplace approximations

  • Newton Optimizers

Any others?

I believe the RHMC has some open adaptation issues. And they all have make issues if we want to figure out how to put them into things like rstan and also figure out how the lack of some functions will be handled.

@bgoodri also brought up the point that Boost is releasing its own forward-mode autodiff.

Say more about this? I’m skeptical we want to do any kind of “autoparallelization” if that’s on your mind… have had lots of bad experiences with that in my past life in industry.

Under @syclik’s “simplify the code base” line item, I think we have long talked about coalescing scal, arr, and mat in the codebase. See this issue. This should really help with developer onboarding and maintenance by reducing coupling and just the sheer number of interacting components.

We could also use someone to take stock of the metaprograms and try to generalize them, upgrade them to C++14 (or get rid of ones that C++14 provides), and generally apply a coherent sense of design to them. These are all located in the 12 meta directories; most will have an entry in prim/scal at least.

[edit] oops, just saw that @rok_cesnovar is planning to taclke the metaprograms (he made an issue here).

I started with writing the missing tests first, if anyone wants to help, they are more than welcome.

Making a list of possible candidates that can be upgraded or removed with C++14 would be a great start. It could be either as a separate issue or a comment to the existing issue. If anyone wants to tackle any of the commenting/tests stuff just comment on the issue.

Have you made the issue? If so, drop me a link. Otherwise, what do you need tests written for?


Great, thank you. I’ll look into this.

Woo hoo! I can leave some notes behind on the issue. I’ve probably put in some 40+ hours, but haven’t gotten really far. Template metaprogramming is tough, but not impossible.

Great question. Right now, I’m prioritizing getting the Math library cleaned up over higher-order autodiff. Then forward mode (stan::math::fvar<double>). Then hopefully we can get it working for higher-order. We really need the testing framework to enable this. My thought right now is that if we get forward mode working, then we can add tests pretty easily for mixed mode and we’ll be able to determine where the bugs are (if we have the test framework, we can essentially drop in the forward tests with the mix framework and make sure that passes).

You’ve listed the ones I know about. I’m sure other users have different uses.

Yes. I’ve been following the thread, but I haven’t determined whether it’ll do the job. My guess: if we want to go to higher order autodiff, it won’t be templated well enough. Also… I don’t think it has nearly enough functions, but I could be wrong.

Not what wikipedia says about automatic parallelization.

I’m thinking that there will be opportunities for us to do within-math-function parallelization for some of the gnarly functions. For example, there’s no reason why we can’t have a threaded version of cholesky_decompose() analogous to the GPU code. Just thinking longer term, I’m imagining there will be ways to put things on different autodiff stacks to compute partial derivatives that we stitch back together like map_rect().

(What I don’t want to evaluate anytime soon in the veins of “autoparallelization” is to alter the reverse sweep traversal by inspecting the stack; I think that’s going to be done much better from the language representation doing graph analysis on the AST. Just so we’re clear, I’m not wanting to go in this direction.)

This is definitely something I want to do sooner rather than later.

For anyone that’s volunteering to help, I want to give fair warning: this is going to be tricky and meticulous. Very meticulous and very tricky. It would really be useful to have at least some of these skills:

  • shell programming (we want to script this)
  • regex skills (sed / awk would do since we’re going to be replacing things in files)
  • being able to trace through precompiler dependency generation (see the -M flag in g++)
  • template metaprogramming and testing of them; we don’t have enough tests in place to do a refactor of this scale easily, so we’ll be writing a lot as we go
  • refactoring… the concept of it, understanding how to do it, just being careful
  • strong git skills; we’ll probably need to do some git bisecting as we do this

If you have any, it would be really cool if you could help. This would be much better as a team effort. It’s going to touch the whole codebase and having more eyes on it will be helpful. I imagine this is going to be fairly painful: I imagine the makefile will break in the process. But, once we get it done, the code base will be a lot cleaner.

I’ll start putting more notes on the issue that @seantalts linked to. stan-dev/math#937

This will definitely be a lot easier to tackle first. @rok_cesnovar has started the process (thanks!!!). As @seantalts mentioned, we can simplify a lot of this code with C++14 in many cases and we should. This also makes flattening things a lot easier.

The issue is here: stan-dev/math#1122.

1 Like