Stan Performance Test


End-to-end errors can accumulate, but I’m not sure what context you’re thinking of.


Interpretation of the IEEE floating point standard is what makes it not end-to-end reproducible. Except for Intel – they don’t follow IEEE rules in fast mode.

Interestingly enough, the pseudo random number generator that we use actually generates identical output across Windows, Linux, and Mac with recent compiler versions.

In the past, I’ve actually been able to run end-to-end tests with seeds that work across Linux, Mac, and Windows. When I was building the service layer, I found that it’s really down to IEEE floating point rules. Once the hamiltonian trajectories are off slightly, even down to something like 1e-10, then over the next 100 iterations, they end up in different spots.


Context, sorry: Running a full model from CmdStan with a given seed and data and checking parameter means and stdevs - these can differ by many thousands from machine to machine (both on clang, both on OS X).

Daniel, that’s interesting that the standard mandates non-reproducibility - you’d figure each library or chip producer would pick some particular interpretation of the standard and that would be deterministic.

If we limit ourselves to the goal of reproducibility with these constraints:

  1. modern intel CPU
  2. specific clang version
  3. OS X (maybe same exact same version?)

What would be the procedure there? Is that enough fixed? Are there specific flags that should be passed?


Okay, I borrowed @betanalpha’s testing method from stat_comp_benchmark and came up with something that I can twiddle to pass with the current setup. I made a new Jenkins job here:

It’s working on all of the vol1 bugs models right now, I need to expand it to all of the example models but that means I need to finish figuring out which ones work and which ones don’t.

When this is done, we’ll be compiling, running, timing, and loosely validating output for almost all of the example models repo. We can use this to test big refactors (it outputs diffs in exact numerical outputs, in addition to the loose pass/fail bounds check) like Mitzi’s type system rewrite and give us a much clearer picture of both performance regressions and something to benchmark against when developing. So excited! This only took me like a day and a half.


The stat_comp_benchmark repo was never designed for end-to-end tests. There is a very old branch that runs a model multiple times to probabilistically verify a central limit theorem on various expectations for simple models that was aimed at end-to-end testing.

I repeat my strong objection to requiring exact numerical results. MCMC algorithms are probabilistic and guarantee nothing but probabilistic results, even in ideal circumstances. Even seemingly trivial changes to the internal code, such as adding a new RNG call or removing an RNG call or even swapping the order of RNG calls, can cause the resulting Markov chain to very quickly to drift away (decouple) from the old Markov chain history. Changes that do not effect the validity of the algorithms will then break any tests that enforce bitwise reproducibility.

If such “golden tests” are going to be implemented not as validation of numerical results but rather simply tests to identify changes regardless of whether the changes are good or bad, then they should not be enforced until there are well-documented scripts that automatically regenerate the tests based on a new branch. If the golden tests are run on Jenkins this would require some mechanism to automatically launch that script on Jenksins and return the resulting test file to be checked in (or perhaps just a hook that updates the test instead of running it).


With you 100%. Just for clarity, we’re talking about two separate things (that are tested together), one is performance and the other is this gold / something approaching exact output test. For performance I think it’s fine if we pin a compiler/machine combo and just use it to spot regressions.

For golds, I agree that a lot of things could spuriously break the gold tests. I think the current gold test in the logistic regression performance test is particularly annoying in that it only runs on one of the Jenkins computers and people have to sort of throw up a PR and have it tested before they know that it’s even changed. I think my #1 priority is being able to run and regenerate the golds with a single command on one’s own computer. Worst case I might just make a vagrant or docker image that is the authority for the purposes of the gold tests that everyone can easily run locally.

Right now I have it working that the answers are okay within a tolerance similar to the one you used for the stat_comp_benchmark and runnable locally, which I find pretty exciting even at that weak level. I also have a diff mode that shows the exact differences in output, but doesn’t fail tests when they differ.


With higher precision arithmetic, like 128 or 256-bit. That’s what Boost does for their math functions. We get about 1e-16 precision with 64 bits, but we only test to around 1e-8. We seem to need to keep arithmetic up to at least the 1e-6 level—around there, Hamiltonian simulation starts to degrade in situations like the ODE solver where I can control the tolerance.

There’s a chapter in the manual. You basically have to pin down everything, including day of the week (just kidding on that last one).

I agree—we can’t be testing exact numerical values. Those aren’t even well defined. I think we should instead be testing to within MCMC std error tolerances or against posteriors we can calculate analytically.


The test I had originally envisioned can be found at

Even if you know the mean and standard deviation analytically, testing the MCMC CLT for a single chain is not particularly useful – if you use a small z-score then you’ll flag false errors all the time but if you use a large z-score then you’ll miss small deviations. So what I instead did was run many chains and compute whether or not the MCMC estimator exceed a certain quantile, assuming a CLT. This becomes a binomial process with probability equal to the quantile probability which can then be tested. I think I may have even gone a step further and repeated this multiple times so that the distribution for p-values of the binomial tests could itself be tests, allowing for very careful control of the false negative rate with compromising the true positive rate much.

The problem is that the code called CmdStan but now we’d have to run entirely in C++, which is a mess given the current state of the code, especially the var_contexts. This was on my to-do list after clean up all the var_context code, but after a million other more important things on that list.


Why not still use CmdStan? We have other upstream tests in place if you think that’ll be easier.

I’d like to understand this. I take it you’re saying this has better sensitivity and specificity properties for detecting errors than just a single MCMC run. Is that because the multiple runs reduce variance somehow? I need to get better at thinking through all this math.

Nothing changed in the var_contexts—we can still build one out of a file, right? Or did that move to CmdStan?

It would certainly be helpful to have an easy way to instantiate a var_context. We’re going to need to do that in order to restart—we need to take variables out of the output and use them to create an initialization. I think @mitzimorris may have written something that helps with this, but I’m not sure.


So I have a version of the performance test working from a branch. It’s current incarnation is as a new python script in CmdStan, but I have a bunch of questions about how I’m doing it to get feedback on. Here is the current configuration:

  1. Top-level script, new tests/golds directory for gold files (currently not used, need to figure out how/if we can make those reasonable. Maybe need a new thread for this).
    a. Name ok?
    b. Should there be some kind of new directory for just performance tests?
    c. Should it instead be in some kind of other unrelated repo? I’m kind of against this because 1) repo sprawl really hurts onboarding and organization and 2) it’s already pretty difficult to be sure of what version of everything you’re testing, if all of the different submodules are clean, etc.
  2. Testing all of the bugs_examples models in the example models repo, which I have as a new submodule in the examples directory in the CmdStan repo.
    a. Is that a good place to put them?
    b. Are those fairly indicative models? I’m working on adding the rest of the ones in this repo. I think we should probably choose a canonical repo (this seems like a good one) and aim to put all public models we know about here.
  3. It’s running a single chain as a single process, and I could have it either run multiple models at once or run the same model with 4 processes, to more accurately simulate the default workflow. Thoughts on this choice?

I’m pretty excited; this has already enabled me and a new contributor to start actually looking at performance in a more holistic way and find a few areas of low-hanging fruit where it can be improved. A key new insight here for me is that having an easy, visible metric really helps new contributors get started and know where to contribute. Of course there are always issues defining a specific metric, but in Stan performance really is important and luckily easy enough to measure.


This sounds great from a high-level perspective.

I don’t think all those models fit stably, so it’s probably going to make more sense to build up from a few tests than down from all these unchecked models.

They’re also not up to date with current Stan coding practices (maybe that’s good for performance testing real user programs).

Given that you need to evaluate n_eff/time, it makes sense to me to have at least four chains running.

This is really exciting!


I was just going to use this to catch performance regressions and do benchmarking of one commit against another. Does n_eff/time add value over just time here? I get that we can’t compare those times to BUGS or whatever directly but that’d be another type of test.


Only for evaluating algorithmic changes, not for evaluating the log density and gradients.


Wait, there’s more. If the goal is just to evaluate the log density and gradients, you’re better off doing that by just evaluating the log density and gradients and not even trying to fit the model. If you fit the model, there’s no guarantee you’ll get the same number of log density evaluations.


I think I’m seeing 3 different types of test for maybe 4 or so use-cases.

Use cases:

  1. Make sure we don’t introduce any performance regressions / generally understand how performance changes over time.
  2. Compare Stan algorithms against other versions of themselves, against each other, and against other algorithms (n_eff/s metric is most appropriate here)
  3. Profile “Stan performance” interactively and holistically to find and evaluate performance improvements
  4. Make sure we understand when we’re changing numerical output, and keep a record of it (i.e. gold tests, we can avoid talking about this use-case for now).

I think the shotgun approach my current script takes is a decent stab at #1 and #3 (and maybe the same script can help with #4 at some point). There could be reasons to profile or measure just a single log_prob evaluation (vs setting a seed for a fit, more on that below), but given that we almost always want to use the results in Stan’s HMC, I think it’s probably best to benchmark in as realistic an environment as possible. This might argue for switching to 4 chains, in case that impacts cache performance cross-core. Right now I have it running 4 models at once, but those will blow up each other’s caches in a way 4 chains might not.

Another good thing to have would be an option to this script (or one like it) that outputs the n_eff/s metric instead of just time. I’m not sure how to get Jenkins to display arbitrary metrics while there’s a decent amount of builtin stuff for response times / benchmarks like this. But I’m sure it can be done.

Won’t we get the same number of log_prob evaluations (and in fact, everything should be deterministic on the same machine, compiler, and compiler flags, right)?


Because upstream tests are a pain to maintain with so many failure modes? Seems natural to check the performance within the stan repo itself and avoid those complications.

Basically, yeah. We want really low false positive rates so that we don’t get into a “boy who cried wolf” circumstance where we start ignoring test failures. But for testing the MCMC CLT that would require something extreme like

\frac{\hat{f} - \mathbb{E}[f]}{\sigma} < 5

which is so loose that it would only be able to detect pretty serious problems and even then the sensitivity would be highest in the tails.

But if we run an ensemble of chains then we can test all of them together. For example, we can test the performance in the bulk of the ensemble by checking whether each chain’s estimator passes the true median or not, and then turning this into a Binomial test,

k_{\text{chains above}} \sim \text{Binomial}(N_\text{chains}, 0.5)

And then we can test deviations from this Binomial distribution.

Nothing changed but it’s a supreme pain in the ass to use within the C++ directly and can’t be done entirely in memory.


This sounds … great? Considering we have literally no end-to-end tests whatsoever, a test that is flexible and detects pretty serious problems seems like a good place to start.


The concern is that the algorithmic issues we have encountered in the past were all very subtle ones that wouldn’t be caught by this broad test. Yes, it would give us some coverage against disastrous mistakes (that would also be caught immediately by other means) and is a fine place to start, but it wouldn’t not give us the guarantees we’d need to ensure that we don’t repeat the more subtle mistakes that lingered for a while.


Sounds good. Worst case we can also do what @syclik wrote about on the wiki - only ever test two git commits against each other on the same machine, compiler, etc. This should always work though it doesn’t have one of my favorite benefits of gold testing, which is having the git log of how and when numerical output changed alongside the code that changed it.


I think it’d be better to regression test autodiff and the language code generation using just log density and gradient evals. This testing is relatively easy.

Then, we test algorithms using only models we know to give the right results. This testing is much harder. It’s also harder to disentangle source of results, but perhaps not if we run them all the time.

Definitely much better to test locally, but sometimes it’s just easier to do it another way. I don’t have a strong opinion here.

Sounds like a great idea. And 0.5 is the optimal point for power of that test, I think (still need to work out the math on that—it’s the highest variance point for the binomial and also the one that’s balanced).

That may be easier to start with, but I don’t think it’s a good long-term solution. What we’re trying to test is statistical behavior, not exact bit-level reproducibility on a machine.