Discussing tests in Stan math

The thread I started at Request for Mathematica code review: generating test cases devolved into discussion about general principles for tests in Stan math. As the title (and start of the discussion) is about something else, I am moving the discussion here to make it better visible and findable and so that more people can join if they like. This post recapitulates the discussion since it started to change topic and has been slightly edited for brevity.

Background: I’ve been working on addressing some issues in neg_binomial_lpmf and related functions. Part of the tests I’ve written involve computing high-precision values in Mathematica and test against those.

@bgoodri mentioned:

I would just reiterate a couple of things that I have said before that few people agree with:

  1. I think we should be testing that known identities hold. For example, since \ln \Gamma\left(x\right) = \ln \Gamma\left(x + 1\right) - \ln x for all positive x, plug in an x to that equation and verify that the equation holds and that all derivatives of interest are the same on both sides.
  2. Even better are identities that hold for integrals involving the function in question over a range of values of x, such as http://functions.wolfram.com/GammaBetaErf/LogGamma/21/02/01/0001/MainEq1.gif .
  3. If we want arbitrary precision tests to compare to, then the gold standard is Arb rather than Mathematica, because the ball arithmetic has bounds that are correct. And Arb is already in C.

My response:

I am trying to approach this from a pragmatic viewpoint. I think developer efficiency in writing tests is really important. My interest in this has been sparked by the fact that many (most?) functions/codepaths in Stan math currently have no or only very weak tests for numerical precision.

IMHO you are making perfect the enemy of the good here. I totally agree that tests against precomputed values shouldn’t be the only tests we do. In my recent struggle with neg_binomial_2 and surrounding functions, both approaches let me to identify some of the issues.

For me the biggest advantage of precomputed values is that those tests are quite easy to create. Let’s say I fear that lbeta fails with one very large and one very small argument. So I precompute values for this case in high precision and check. Not bulletproof, but quick. I am sure a clever formula-based test would show this as well, but I can’t make up one quickly.

Another issue with formulas is that I always have to worry about the numerics of the formula itself.
And there are functions for which good formulas are hard to come by, for example the lgamma_stirling_diff function I wrote (defined as the difference between lgamma and it’s Stirling approximation). Here any tests actually involving lgamma are weak because the interesting use cases for lgamma_stirling_diff are when lgamma is large and we would need more digits to capture the desired precision. I believe you may come up with some clever way around this, but using precomputed values let me to have a reasonably strong test quickly.

Using Arb might well be a better way forward in the long run, but I personally lack the skill and patience to work with C/C++ more than I have to. The cloud Mathematica is a good enough tool for me in that it lets me iterate quickly (and I already use it to check my symbolic manipulations). If someone integrated Arb with the test code in Stan math, than it would be a big advantage for Arb as the tests wouldn’t need to rely on external code. But I don’t think that has been done yet.

You seem to imply that Mathematica can provide wrong results in cases where Arb would work. That would be a good reason to prefer Arb over Mathematica. I tried to quickly Google for some evidence in this regard and didn’t find any.


Where I think I would agree with you is that the unit tests of Stan Math emphasize whether code compiles and whether the code is correct, as opposed to whether the code produces the correct numbers for almost all inputs that are likely to be called when doing dynamic HMC. And I wrote a bunch of unit tests starting at


that did what I thought was an admissible test for each of the univariate _lpdf functions, namely testing the identity that their antilog integrates to 1 over their entire support and the derivatives with respect to all of the parameters are zero. Fortunately, most of our _lpdf functions satisfied this after some tweaking. I couldn’t do exactly that for the _lpmf functions, and I am not surprised that you encountered some issues with the negative binomial that should be addressed.

But I am bitter that the success of this approach has not changed anyone’s mind about how to unit test functions in Stan Math more generally. I have made what I think are strong arguments about this for years, and they have mostly been ignored. I appreciate that you did take the time to make some good-faith arguments for your approach, although I am not that persuaded by them. I’m just going to let it drop for now. If you make a PR with tests like this, I won’t object and I haven’t objected to anyone else’s PRs on these grounds.

@Bob_Carpenter joined:

I think you may have misinterpreted my earlier comments. I think this is a good idea, just not always necessary if we can do thorough tests versus well-defined base cases like the lgamma from the standard library.

To really get traction on this, it should be automated to allow us to plug in two functions and a range of values and make sure all derivatives have the same value. I can help write that. It’d look something like this:

template <class F, class G>
expect_same(const ad_tolerances& tol, const F& f, const G& g,
         Eigen::VectorXd x);

where the vector packs all the arguments to f or g. We cold get fancier and put in a parameter pack for x and try to use the serialization framework to reduce it to a vector to supply to our functionals.

As @martinmodrak points out,

That’s why I included tolerances.

Wouldn’t we need to write all our functions to deal with their data types? Or recode them?

I don’t get this comment. From my perspective, both @martinmodrak and @bgoodri are suggesting new, complementary tests that go beyond what we have now. If people want to put in effort making tests better, more power to them!

I don’t have any opinion about Arb vs. Mathematica in terms of coverage or accuracy. I’d never heard of Arb before @bgoodri brought it up. The problem I have is that I can’t read that Mathematica code. So the discussion we should have is whether everyone’s OK introducing one more language into the project after OCaml and C++.

On a related note, the thread on tests for distributions discusses possibilities for strengthening current tests for distributions, including precomputed values but also using complex-step derivative.

The thread on tolerances at `expect_near_rel` behaves weirdly for values around 1e-8 is also relevant.

1 Like

This would IMHO be a highly useful tool.

We already have a bunch of precomputed values lying around individual tests and nobody is worried that there is no code to recompute them, so having some code is IMHO a strict improvement. But I understand the concern and especially if we ended up building substantial support infrastructure in Mathematica, this might eventually become an issue.

I think having some precomputed high precision result is very valuable. In one of Martin’s PRs there was a computation that looked to me to be easily simplifiable. Well, it turned out that there was a specific numerical reason why some redundant-looking ooeration were done that way, as when I simplified the code some tests didn’t pass. I had effectively undone Martin’s hard work!

I’m perhaps stating yhe obvious, but I think that whatever testing approach we take is able to express the requirement we need. If no single approach is enough, we should use complementary strategies to ensure we won’t regress in the future.


I took a stab at generating interesting points for you and also at using Interval arithmetic. It doesn’t attempt to ‘do everything’. It just shows approaches of which people may not be aware.


Note that Interval arithmetic (and other methods that don’t track all the inputs and outputs) don’t take into account possible correlations between numbers in intermediate steps, so the output interval may be wider than what one would sample in practice:

1 Like

Interval arithmetic is fun. And it can be implemented with overloading just like autodiff. It’d be a lot of work for our whole library, though!

What I’d like to know is if there is a C++ library that can track the flow of statistical distributions through my calculations (using templates and operator overloading like the AD ones do), so I could make statements about the reliability of my answers. I’m sure something like the mean and coefficient of variation could be tracked using a forward mode that does calculations like the ISO Guide to Uncertainty in Measurement, but I wonder if there is a better way (other than Monte Carlo analysis, I guess).

1 Like