Request for Mathematica code review: generating test cases

I’ve been trying to implement/improve some functions in the math library and I’ve find it useful to use the free cloud version of Mathematica to generate test cases. I taught myself basics Mathematica/Wolfram language in the process and it is well possible I am missing something basic. Are there Stan users with Mathematica expertise that could double check that I did not mess up?

Tagging users that have talked about Mathematica here recently:
@vincent-picaud, @benwhite59, @sakrejda

The most polished case I have now is for lbeta (logarithm of Beta function). The code is below, but can also be run from the notebook at: https://www.wolframcloud.com/obj/martin.modrak/Published/lbeta.nb
Here, I provide some additional comments to clarify what I intended. Overall I am generating C++ code that gets pasted into the test file. I want gold standard for both function values and derivatives w.r.t all arguments, which I hope Mathematica’s arbitrary precision math can give me.

lbeta[x_,y_]:= LogGamma[x] + LogGamma[y] - LogGamma[x + y]
dlbetadx[x_,y_]= D[lbeta[x,y],x];
dlbetady[x_,y_]= D[lbeta[x,y],y];

singleTest[x_,y_]:= Module[{val, cdx,cdy, big},{
  val = N[lbeta[x,y],24];
  (* Evaluating derivatives for large arguments exceeds memory, avoid it *)
  cdx = If[x > 10^6 || y > 10^6,"NaN", CForm[N[dlbetadx[x,y],24]]];
  cdy = If[x > 10^6 || y > 10^6,"NaN", CForm[N[dlbetady[x,y],24]]];
  WriteString[out,"  {",CForm[x],",",CForm[y],",",
       CForm[val],",",cdx,",",cdy,"},"]
}];

(* Wolfram Cloud limits the number of elements you can output, so writing to file.  *)
out = OpenWrite["lbeta_test.txt"]

xs= SetPrecision[{0.00000001,0.004,1,1.00000001,5.0,23.0, 19845.0}, Infinity];
ys = SetPrecision[{0.00000000001,0.00002,1.000000000001,0.5,2.0,1624.0}, Infinity];
WriteString[out, "std::vector<TestValue> testValues = {"];

For[i = 1, i <= Length[xs], i++, {
  For[j = 1, j <= Length[ys], j++, {
     singleTest[xs[[i]], ys[[j]]];    
  }]
}]
WriteString[out,"};"];
Close[out];
(* I couldn't find a way to access the file on cloud, so I write it to output 
  which avoids most of the fragmentaiton*)
FilePrint[%]

The most tricky part is that I want to compute in arbitrary precision, but I need to start with a representable floating point number. This is to avoid false positives in test when the C++ implementation actually calculated correct high-precision output but the difference between the exact number and nearest representable float is big enough to be non-negligible. An example to illustrate the issue:

y = 1;
N[lbeta[1 + 10^-12,y], 22]
  -9.999999999995000000000*^-13

N[lbeta[1.000000000001,y], 22]
  -1.00009*^-12

N[lbeta[SetPrecision[1.000000000001,Infinity],y], 22]
  -1.000088900581840922709*^-12

I now test for differences up to 10^-16 so the first one triggered a false positive in the test.

Thanks for any input on the code.

I know a little bit of Mathematica but no more than you do. I would just reiterate a couple of things that I have said before that few people agree with:

  1. I think we should be testing that known identities hold. For example, since \ln \Gamma\left(x\right) = \ln \Gamma\left(x + 1\right) - \ln x for all positive x, plug in an x to that equation and verify that the equation holds and that all derivatives of interest are the same on both sides.
  2. Even better are identities that hold for integrals involving the function in question over a range of values of x, such as http://functions.wolfram.com/GammaBetaErf/LogGamma/21/02/01/0001/MainEq1.gif .
  3. If we want arbitrary precision tests to compare to, then the gold standard is Arb rather than Mathematica, because the ball arithmetic has bounds that are correct. And Arb is already in C.
2 Likes

@martinmodrak

I would be glad to help concerning Mathematica usage, do not hesitate to contact me.

I tried to put in form your example here. Does it solve your problem? Feel free to reuse the code or ask for refinement/clarification.

1 Like

A possible alternative to WolframCloud:

As I have a MMA licence at work I have not really tested it, but I have heard that you can now install Wolfram Engine for free and use it through Jupyter notebooks:

Maybe it is more convenient to use this local solution than the WolframCloud solution.

Thanks @vincent-picaud for looking into this. Your code is very elegant and I’ve obviously missed out on many things that can make my life easier. Your code is also possibly a bit incomprehensible to people who do not understand Mathematica. So I have to think about this tradeoff - to what extent do we need people to understand the “meta” code that generates the test values. Thanks a lot, there really is a lot of food for thought for me.

Also thanks @bgoodri for your view. I’ll first address the tone (as I perceived it) than the content. On first reading, your post read quite combative and my impression was a mix of “tests against Mathematica are worse than no tests” and “don’t write any tests until you learn how to do it properly”. I believe this was not your intention and at least partially an artifact of my state of mind at the point. I however wanted to highlight that this is how some people might read it and it might (and might not) be a useful data point for you.

For the content:

I am trying to approach this from a pragmatic viewpoint. I think developer efficiency in writing tests is really important. My interest in this has been sparked by the fact that many (most?) functions/codepaths in Stan math currently have no or only very weak tests for numerical precision.

IMHO you are making perfect the enemy of the good here. I totally agree that tests against precomputed values shouldn’t be the only tests we do. That’s why I also wrote some tests against identities for the lbeta PR. But in my experience using formulas and precomputed values is complementary. In my recent struggle with neg_binomial_2 and surrounding functions, both approaches let me to identify some of the issues.

For me the biggest advantage of precomputed values is that those tests are quite easy to create. Let’s say I fear that lbeta fails with one very large and one very small argument. So I precompute values for this case in high precision and check. Not bulletproof, but quick. I am sure a clever formula-based test would show this as well, but I can’t make up one quickly.

Another issue with formulas is that I always have to worry about the numerics of the formula itself.
And there are functions for which good formulas are hard to come by, for example the lgamma_stirling_diff function I wrote (defined as the difference between lgamma and it’s Stirling approximation). Here any tests actually involving lgamma are weak because the interesting use cases for lgamma_stirling_diff are when lgamma is large and we would need more digits to capture the desired precision. I believe you may come up with some clever way around this, but using precomputed values let me to have a reasonably strong test quickly.

Using Arb might well be a better way forward in the long run, but I personally lack the skill and patience to work with C/C++ more than I have to. The cloud Mathematica is a good enough tool for me in that it lets me iterate quickly (and I already use it to check my symbolic manipulations). If someone integrated Arb with the test code in Stan math, than it would be a big advantage for Arb as the tests wouldn’t need to rely on external code. But I don’t think that has been done yet.

You also seem to imply that Mathematica can provide wrong results in cases where Arb would work. That would be a good reason to prefer Arb over Mathematica. I tried to quickly Google for some evidence in this regard and didn’t find any.


The above really is just to clear my mind as I had some negative emotions about your initial reply. I am not sure this thread should be turned into a discussion about best practices for tests in Stan math. If you think such discussion is warranted, it might be better to start a new thread. I briefly considered starting new thread myself, but I am not sure you want to be dragged into such discussion at this point or that is useful topic to spend the dev attention on right now, so I’ll let you make the call :-) :evading-responsibility-emoji:

Thank you @martinmodrak for your feedback.
Yes, MMA code looks weird (even if it is a tool I use regularly).
My code is only a suggestion reflecting the solution I would have used myself in your situation.
If you decide to use the code, I can clarify the implementation and/or modify it to better fulfills your needs. Just ask me.

If someone were to make an argument of the form “tests like X are worse than no tests”, I hope I would be the first person to post that their argument was wrong and do so in a way that upset them. Hopefully, I won’t be responding to myself. The closest I have come to saying something like that was back in 2014 when I said

I think the benefit of the current function_signatures tests is not worth the time, electricity, wear on spinning hard disks, etc. I hate these tests more than I hate pennies, which I would probably value at no more than 1/10th of a cent.

but that was, IMO, a solid argument based on opportunity cost, although it later came to light that there were some bugs that could only be caught if we actually did a full compilation of the functions being called.

Where I think I would agree with you is that the unit tests of Stan Math emphasize whether code compiles and whether the code is correct, as opposed to whether the code produces the correct numbers for almost all inputs that are likely to be called when doing dynamic HMC. And I wrote a bunch of unit tests starting at

that did what I thought was an admissible test for each of the univariate _lpdf functions, namely testing the identity that their antilog integrates to 1 over their entire support and the derivatives with respect to all of the parameters are zero. Fortunately, most of our _lpdf functions satisfied this after some tweaking. I couldn’t do exactly that for the _lpmf functions, and I am not surprised that you encountered some issues with the negative binomial that should be addressed.

But I am bitter that the success of this approach has not changed anyone’s mind about how to unit test functions in Stan Math more generally. I have made what I think are strong arguments about this for years, and they have mostly been ignored. I appreciate that you did take the time to make some good-faith arguments for your approach, although I am not that persuaded by them. I’m just going to let it drop for now. If you make a PR with tests like this, I won’t object and I haven’t objected to anyone else’s PRs on these grounds.

I think you may have misinterpreted my earlier comments. I think this is a good idea, just not always necessary if we can do thorough tests versus well-defined base cases like the lgamma from the standard library.

To really get traction on this, it should be automated to allow us to plug in two functions and a range of values and make sure all derivatives have the same value. I can help write that. It’d look something like this:

template <class F, class G>
expect_same(const ad_tolerances& tol, const F& f, const G& g,
         Eigen::VectorXd x);

where the vector packs all the arguments to f or g. We cold get fancier and put in a parameter pack for x and try to use the serialization framework to reduce it to a vector to supply to our functionals.

As @martinmodrak points out,

That’s why I included tolerances.

Wouldn’t we need to write all our functions to deal with their data types? Or recode them?

I don’t get this comment. From my perspective, both @martinmodrak and @bgoodri are suggesting new, complementary tests that go beyond what we have now. If people want to put in effort making tests better, more power to them!

I don’t have any opinion about Arb vs. Mathematica in terms of coverage or accuracy. I’d never heard of Arb before @bgoodri brought it up. The problem I have is that I can’t read that Mathematica code. So the discussion we should have is whether everyone’s OK introducing one more language into the project after OCaml and C++.

But @martinmodrak, you’re the one who opened this can of worms by saying that our current tests were not best practice because they were low precision. I wasn’t offended by that, by the way—I’m totally open to having people improve things I did. That’s how we make the project better.

1 Like

I was trying to make the narrow point that this thread is not a good place for general discussion on tests, but I see that the sentence is ambiguous. Anyway I’ve moved the discussion to Discussing tests in Stan math so that it is obvious people are discussing tests, please continue the discussion on tests in general (if you want so) in the new thread. Discussion specific to using Mathematica for precomputed values should IMHO stay here.

1 Like