Bringing math testing time & resources down

In Jenkins Updates, Issues & Requests @syclik mentioned that maybe a separate thread on bringing the testing time down should be started. This is the thread (motivated by waiting a lot for tests both locally and on Jenkins). Also tagging @serban-nicusor as he seems to be doing stuff with Jenkins right know.

I think there is some low-hanging fruit, but maybe I’m wrong. My starting point (which is my experience, but might not be everybody else’s) is this:

Most builds fail at at least one test.

So shortening the time to first failure might be a good optimization target. The current approach: build all tests first, then run all tests is suboptimal - building takes ages, running is fast. And in the end I don’t even see the output of all tests. E.g. when I tried to resolve failures that were unique to Linux (I develop on Win), I had to fix one failure, wait for all tests to build and only then I saw a second failure in a different test.

So there could be a benefit from running tests as they are built or at least split the test build into more pieces (e.g. by subdir). The latter could also mean that the build can run on multiple executors if resources are available.

To be more fancy, could run tests that are related to the changes in the PR first. But no need for code analysis - a simple way would be to reorder tests by some fuzzy string match to the list of files changed, so that the tests that are most likely to fail execute first. This is also low risk - all tests are eventually run, we are just changing the order.

Does any of that make sense?


What about storing the last few jenkins directories, and then unpacking the diff into it? Make would pick up the change times and rebuild only the ones that changed.

I’m all for seeing the testing pipeline fail sooner when there are errors (I think that if you develop on Windows you are slighly luckier, as there very early Linux stages that may catch compilation issues you might not have seen).

The idea of changing the order of tests is interesting, but I think it applies mainly to distribution tests (which is fine, as that’s one of the bottlenecks of the all pipeline), right? I’m not sure how to convice make about this, but it’s probably doable. As for starting running tests for a given distribution before all tests have compiled, I think that would already be an improvement over what we have now.

Much of the pipeline is already using multiple cores (distribution tests are run with -j25 for example), so I’m not sure there’s much to do on this front.

I can offer a couple of hints that saved me a bunch of times by running selected tests locally:

As for other ideas for speeding up testing in general, ccache has come up a few times, but as far as I know nobody has actually tried if if brings any benefits in our setup (I don’t have any experience with that). Other discussions: Speeding up testing.

Partially tangential, there have been recent mentions of avoiding running tests if there are changes only in the doxygen directory, which should be easily achievable. Potentially that could also be applied to the .github and licenses directories, but overall this would affect a minority of PRs. Reintroducing something like ci-skip could help with this.

First I should note I have almost zero experience with Jenkins and make, so I have no idea what is simple, and I am just guessing.

Most of my recent test failures actually come to unit tests. I think even a split by prim, rev , mix, … could IMHO help.

But the builds could also be split among executors (not sure to what extend do we have free executors generally, so this might not work very well).

That’s great, thanks I didn’t know of those.

One more (possibly) low hanging fruit is to install OpenCL on other executors (both AMD and Intel have OpenCL implementations targeting their CPUs) which could avoid the “Headers check with OpenCL” bottleneck which I’ve been seeing quite a bunch lately. Tests that actually run OpenCL code should probably still run on actual GPUs as the CPU OpenCL platform sometimes behaves differently than GPUs, but for headers check it should IMHO be OK.

But really, just my two cents, I am glad I don’t have to manage the CI and grateful for the work you guys do on it.

There are no free executors typically, so I am skeptical this would help. If we had many executors it would be a good idea, that I agree.

When we added the headers check with opencl I could not decide whether to put it separate or just do the check on the Full unit with GPU. For a jenkins test run for a PR that passes there is not much difference. However, if you are first and foremost interested in seeing if the Linux Unit tests pass, this might actually be a bottleneck. I havent really thought of that.

My proposal would be to change the pipeline to the following (after the “Linting & Doc checks” stage):

  • Headers check
  • Linux Unit with MPI
  • Parallel stage with: Full unit with GPU (with OpenCL header check), Distribution tests, Threading tests, Windows Headers & Unit, Windows Threading


  • Most of the time Linux Unit with MPI finishes in 20 minutes (unless if it gets assigned to one specific worker, then its an hour) and if I am not mistaken the largest number of the executors are Linux so this should not be such a bottleneck
  • a lot of jobs wait a lot of time for the GPU device. This way we reduce the executions of GPU jobs (jobs that will fail on linux will not run on GPU)
  • the wait for GPU executors is done in parallel with the distribution tests
  • if Linux unit tests pass most of the time the tests will go green (provided there arent any false positives due to CI issues)

Additionally we really need to remove the gelman linux machine from the distribution tests workers, as that job runs for 5-6 hours there.

Does that sound good?


Thanks for bringing this up! Yes, this is exactly what I was thinking about. If we always started with the state of develop (specifically, .o files and the executables), make is designed for traversing through dependencies to determine the minimal things needed to be rebuilt. As long as time-stamps are respected, we could make that work. The dependencies are already generated as .d files when you build any of the test targets from make.

Unfortunately, in an effort to make it easier on developers, we’ve relaxed the goal of minimal includes. So this would work for some of our older unit tests, but the newer ones wouldn’t see a benefit because they tend to include everything at once. With some effort, we could get back to that standard, but it won’t be free and it’s hard for tools to figure it out given how we instantiate our different modes of autodiff.

That definitely could work!

Right now, if there’s any change to any header file, even trivial, it will rebuild ALL the distribution tests because we’re now including #include <stan/math/mix.hpp> and that ties the whole codebase together.

If we wanted to reorder the compilation and running of the distribution tests, it would be best to do that outside of make. (Make isn’t meant for this sort of ordering. It does what it does really well and everything else, not so well.)

We could split this into as many jobs as we need! Especially if we know which ones run fast and fail fast.


1 Like