Separation of concerns in autodiff tests

I am currently trying to implement some helper methods that would make it easier to write strong tests for accuracy of the values and derivatives in Stan math. The most powerful test helper right now is the expect_ad function which serves three purposes:

A) Tests that it is possible to instantiates all versions of the function (with all combinations of double, var, fvar, fvar<var>)
B) Tests whether the gradients and Hessians computed by finite diff and by autodiff are similar (often with relatively weak tolerances as finite diffs are finicky). This test are run for each of the instantiations.
C) (implied consequence of A) and B)) Tests that all instantiations - which often don’t share code - compute the same value, derivatives and Hessian if applicable.

I am trying to write a novel test helper with working name expect_expression_identity that would help to test that an identity holds for a function. @Bob_Carpenter mentioned it would be preferable to have this helper also test the identity across all instantiations. I don’t think this is optimal but was unable to state my concerns clearly in the GitHub issue so I am writing at more length here.

Why care?

There are multiple test helpers that IMHO should be implemented, and those would pose the same question regarding instantiations, so better to resolve it once: I think we definitely want to have expect_precomputed for streamlining tests against precomputed values and expect_complex_step that would test the gradients from all modes against complex step differentiation. I also remember @bgoodri mentioning tests of integrals of functions, which could also use some test helpers, and possibly a few more would come up

Current approach

For a hypothetical two parameter function foo, the current practice would have roughly this testing structure (I hope I don’t mess up details, but the big picture is hopefully OK):

In expect_ad

  1. foo(var, double) is near foo(double, double)
  2. foo(double, var) is near foo(double, double)
  3. foo(var, var) is near foo(double, double)
  4. gradient of foo(var, double) is near finite diffs of foo(double, double)
  5. gradient of foo(double, var) is near finite diffs of foo(double, double)
  6. gradient of foo(var, var) is near finite diffs of foo(double, double)
  7. foo(fvar, double) is near foo(double, double)
  8. foo(double, fvar) is near foo(double, double)
  9. foo(fvar, fvar) is near foo(double, double)
  10. gradient of foo(fvar, double) is near finite diffs of foo(double, double)
  11. gradient of foo(double, fvar) is near finite diffs of foo(double, double)
  12. gradient of foo(fvar, fvar) is near finite diffs of foo(double, double)
  13. foo(fvar<var>, double) is near foo(double, double)
  14. foo(double, fvar<var>) is near foo(double, double)
  15. foo(fvar<var>, fvar<var>) is near foo(double, double)
  16. gradient and Hessian of foo(fvar<var>, double) is near finite diffs of foo(double, double)
  17. gradient and Hessian of foo(double, fvar<var>) is near finite diffs of foo(double, double)
  18. gradient and Hessian of foo(fvar<var>, fvar<var>) is near finite diffs of foo(double, double)

Current approach - extrapolation

For expect_expression_identity (not implemented yet) following the same approach would entail those tests:

  1. f(foo(double, double)) is near g(foo(double, double))
  2. f(foo(double, var)) is near g(foo(double, var)) (including derivatives)
  3. f(foo(var, double)) is near g(foo(var, double)) (including derivatives)
  4. f(foo(var, var)) is near g(foo(var, var)) (including derivatives)
  5. f(foo(double, fvar)) is near g(foo(double, fvar)) (including derivatives)
  6. f(foo(fvar, double)) is near g(foo(fvar, double)) (including derivatives)
  7. f(foo(fvar, fvar)) is near g(foo(fvar, fvar)) (including derivatives)
  8. f(foo(double, fvar<var>)) is near g(foo(double, fvar<var>)) (including derivatives and Hessian)
  9. f(foo(fvar<var>, double)) is near g(foo(fvar<var>, double)) (including derivatives and Hessian)
  10. f(foo(fvar<var>, fvar<var>)) is near g(foo(fvar<var>, fvar<var>)) (including derivatives and Hessian)

I however believe it is impractical to require all tests helpers to evaluate all instantiations (the required templating is tedious) and that the implied tests for C) are unnecessarily weak.

Proposed approach 1

We are IMHO better served by writing a separate test that directly tests A) and C) (called expect_instantiations below) and leave expect_ad to handle B) as a separate problem. In other words that we should separate concerns of the tests.

The proposed tests would than be

  1. foo(var, var) equals foo(double, double)
  2. foo(var, var) equals foo(double, var) (including shared derivatives)
  3. foo(var, var) equals foo(var, double)) (including shared derivatives)
  4. foo(fvar, fvar) equals foo(double, double)
  5. foo(fvar, fvar) equals foo(double, fvar) (including shared derivatives)
  6. foo(fvar, fvar) equals foo(fvar, double)) (including shared derivatives)
  7. foo(fvar<var>, fvar<var>) equals foo(double, double)
  8. foo(fvar<var>, fvar<var>) equals foo(double, fvar<var>) (including shared derivatives and Hessians)
  9. foo(fvar<var>, fvar<var>) equals foo(fvar<var>, double)) (including shared derivatives and Hessians)


  1. gradient of foo(var, var) matches finite diffs of foo(double, double)
  2. gradient of foo(fvar, fvar) matches finite diffs of foo(double, double)
  3. gradient and Hessian of foo(fvar<var>, fvar<var>) matches finite diffs of foo(double, double)


  1. f(foo(double, double)) is near g(foo(double, double))
  2. f(foo(var, var)) is near g(foo(var, var)) (including derivatives)
  3. f(foo(fvar, fvar)) is near g(foo(fvar, fvar)) (including derivatives)
  4. f(foo(fvar<var>, fvar<var>)) is near g(foo(fvar<var>, fvar<var>)) (including derivatives and Hessian)

The improvement here is not only that we restrict the scope of messy templating, and make implementing new helpers easier, but also that we can IMHO now test C) very strictly - in expect_instantiations equivalence up to few ULPs (as implemented in EXPECT_DOUBLE_EQ) should IMHO be achievable. If this is the case, we can safely assume that if expect_instantiations passes, the much weaker tolerances in expect_ad and expect_expression_identity are also satisfied for all instantiations by transitivity.

Proposed approach 2

If we really want to maintain testing for all instantiations in all cases, some of the simplifications could also be achieved by writing a helper that just evaluates a functor and its derivatives and Hessians for all instantions and stores the results in some manageable structure (say a vector of values, a matrix of derivatives with NAs in places where the derivatives is not computed for the given instantiaion and a similar 3D structure for the Hessian) so that other helpers just call this and then iterate over the results without templates. I however like this one a bit less, although I admit the reasons are more esthetic than technical.

Tagging @Bob_Carpenter and @syclik for comments.
Thanks a lot for reading through this!

1 Like

A few minor comments:

I like expect_values better, though I guess the point is it does all the double/T combinations to make sure everything is properly templated.

In terms of how Stan is written in terms of prim/rev/fwd/mix it would be more convenient to have individual access to:


But those would be mostly for development I guess.

expect_expression to match the other stuff, or maybe expect_expression_near? This would be quite handy though, thanks.

1 Like

I don’t think I can contribute in useful ways to changes in expect_ad, but if there is future work happening in that area, I’d like to point out a weakness of the current approach (at least as I perceive it). I hope that this adds a meaningful detail without derailing the conversation.

Specifically, if a function throws, then it’s expected that all versions (prim/rev/fwd) throw. This is good, and it has helped us finding and fixing inconsistencies when we switched to expect_ad. However, it would be clearer and safer to have to state whether we expect exceptions to be thrown (assuming a default setting that we do not).

Having that would clarify our expectations when writing tests and not allow bugs in tests themselves to go unnoticed. This has come up recently in, where the mix tests for quad_form_sym were always throwing (so no real code was being tested) because they were setup incorrectly.


It’s double, var, fvar<double>, fvar<fvar<double>>, fvar<var> and fvar<fvar<var>>.

It goes up to third order and tests both forward and reverse where both are available.

  • third order: fvar<fvar<var>>
  • second order (Hessians): fvar<fvar<double>> and fvar<var>
  • first order (gradients): fvar<double> and var

and it tests both `fvar<varalso tests that the values are the same and that if one version throws an exception, so do all the other versions.

What’s the signature? Would that take in first, second, and third-order derivatives?

Lots more gets tested here. It tests foo(T, T), foo(T, double) and foo(double, T) for all autodiff types T, and for each one it tests values or exceptions match, then if no exceptiosn it tests two forms of gradient, two forms of Hessian, and gradient of Hessian. Basically all the autodiff instantiations we need to do Riemannian HMC.

I think it’d be easier to summarize with templates. What I was proposing was testing all of f(T, T) vs. g(T, T), f(T, double) vs. g(T, double) and f(double, T) vs. g(double, T). Again for all the possible instantiations, checking values, derivatives, and exceptions.

It’s already done for expect_ad. And while it’s a bit tedious, it’s not that difficult.

Isn’t (B) just © done by finite diffs?

The finite diff tests are definitely low precision. Usually only three decimal places for Hessians. The reason I thought that would be reasonable is that as far as I know we don’t have a single specialization of fvar<var> or fvar<fvar<double>>. We could probalby get away with not testing Hessians at all. They shouldn’t go wrong if they’re compilable.

The Hessian and gradient Hessian tests could be made more precise by using finite diffs of gradients rather than double finite diffs. I just didn’t think it’d be worth the effort.

That’s limited to functions we have complex implementations for, right?

I was expecting these to just to be used for double values and everything else to follow automatically from finite diffs. But if we need more precision, this is the only approach.

What signature are you suggesting? We can have expected gradients, Hessians, or gradients of Hessians.

I’m confused becuase (B) just looks like an approach to computing © with finite diffs. how are the goals of (B) and © different?

This is the one I was objecting to on the issue tracker for not covering the foo(T, double) and foo(double, T) cases, which almost always have different implementations for autodiff types T. I don’t see how only testing foo(T, T) is sufficient.

I don’t find this very messy compared to the templating in the actual math lib, but then I’ve been neck deep in templates for years now.

We somehow need to find a way to get the true values for all of our functions and 1st-, 2nd-, and 3rd-order derivatives. Consider a function like matrix multiplication. Even for a (2 x 2) matrix multiplied by a (2 x 2) matrix, the gradient is size 32. And would need to be flattened somehow or the templates get really hairy. But then for Hessians, it’s really big. Right now, the framework does this all automatically, where the exhaustiveness really limits the size of things we can test.

In the past, people have talked about doing reductions, like summing the output of the matrix multiply, so there’s only one output. This might drastically reduce compute time in some of our tests. The issue’s making sure it works.

Another thing to think about that is more important than any of this. Right now, we’re not testing boundaries very well because finite diff doesn’t work at a boundary (we step out both ways in our finite diff). What I did in some tests is compose a constraining transform. It’d be great to have a way to automate this kind of testing.

And then there’s also the idea floating around that we should automate certain kinds of identity checks. As of now, we’re doing that by hand, too, in some places.

There’s an expect_all_throw function that does that. It probably assumed that everything’s already been flattened—I’d have to check.

The AD test framework assumes the primitive tests (not using the framework) have been thoroughly tested for value and exception conditions. So the primitive tests are where to put these and then the same inputs can be used to test autodiff.

I think this is literally just testing case (A), which is now being done implicitly through expect_ad().

I’d particularly like something that only tests reverse mode, because we have a bunch of functions that only work in reverse mode.

Yes, but if the function tested with expect_ad throws because of a typo or else, you won’t be told about that and you may be left without rev/fwd tests for the non-exception cases, as in the bug linked above. I’d rather have the test itself telling me that I’m setting up the test wrongly. :)

Yes, just expect_ad and expect_ad_var would be sufficient for what I’ve needed. Which probably makes it better cause it’s simpler.

1 Like

There’s a lot of useful input on tests in general, thanks for that.

I however have to admit I am slightly frustrated that the discussion almost ignores what I believe are the core questions: How do we make writing test helpers simpler? Do we want test helpers to have clearly separate concerns or do we prefer them to test very broadly and overlap?

So far the only reaction that is IMHO directly relevant to my core concerns is:

Which I interpret as: “We want broad tests and we don’t care a lot for making test helpers simpler to write”. And I am open to this being the best solution, although I remain unconvinced at this point.

I also don’t want this discussion to take more time than it would take to actually implement the test helpers the current (IMHO more tedious and harder to maintain) way. I do care about this as currently it is me implementing those and I have limited time to devote to this. I hope you trust me that I spent some time thinking about this and my proposals are not completely frivolous. Please try to aim for the core of the issue. It is also possible I am just communicating this badly, but I don’t currently see how I can do much better, so I’d like to ask for a charitable re-reading of my proposals.

As I said there are a lot of good points and I agree I missed a bunch of stuff expect_ad is doing, but I think those are minor details (if you disagree, please explain why do you think a particular point is important for the big picture).

Below I react only to stuff I believe touches on the core:

B) and C) overlap only partially, as say the gradients from foo(T, double) and foo(T,T) are never directly compared to each other. Note that I could currently take for example log_sum_exp_vd_vari and multiply its gradient by 1 + 1e-8 (but not do this for the other versions) and expect_ad and the proposed expect_expression_identity would likely still pass for all instantiations as 1e-8 is below most tolerances we use for gradients. The advantage of testing C) by direct comparison between say foo(T, double) and foo(T,T) is that I can have very low tolerance. I agree few bugs are likely to manifest this way, so it is not a big deal. My main goal is in simplifying the testing code and making it more maintainable.

I believe that testing time/memory consumption is currently strongly dominated by compilation, so I wouldn’t worry about this a lot. But please correct me, if I am wrong. As I noted that amount of templates is limiting for some tests we have I guess that reducing templating could actually be a net gain in test time even if we instead construct some large matrices during tests. Both my proposals reduce the number of templates instantiated per test.

If we can put very strict limits on differences between the results (values, gradients, hessians, …) of foo(double, T), foo(T, double) and foo(T,T), then any strong test on foo(T,T) is also a strong test for foo(double, T) and foo(T, double). The proposed expect_instantiations would fail if the differences are anything but negligible. Fruther please note that my Proposal 2 means all instantiations are tested in all cases, but still makes implementing new test helpers easier.

I’m convinced.

From an interface point of view, how about breaking the top level down into two different functions, such as:

expect_ad(f,  ...);

expect_ad_throw(f, ...);

where the former explicitly requires not throwing and the latter explicitly specifies throwing.

1 Like

OK, so since it appears nobody has any opinions, I’ll default to (try to) implement the new test helpers respecting the status quo and will see where that leads.

Hmm, now that I’m looking back at this, can’t this be implemented by just wrapping the current stuff?

void expect_expression_identity(T1 f, T2 g, T3 foo, T4 arg1, T5 arg2) {
  expect_ad([&](auto& arg1, auto& arg2) {
    return f(foo(arg1, arg2)) - g(foo(arg1, arg2));
  }, arg1, arg2);

I think in general we want the helpers to be super broad and test everything.

Though when I’m writing code I’d like to break out just the var tests as well, especially for the higher order functions that don’t support anything else and during development. Right now the tests aren’t that modular (in that it’s not easy to pull a little of the functionality out).

Ofc. if tests started running slowly then I’d want to break them out. I think the way you proposed the spit makes sense.

That’s my main motivation. It’d also be useful to test cases we know work differently for reverse mode.

They’re very modular in the code. So it’d be easy to add a flag structure parallel to tolerances that indicated which variable types to test. I was suggesting just making tolerances infinity to turn the tests off, but that’s a hack with not quite the right semantics.

Such implementation would provide only very weak tests. First the finite differences (used in expect_ad) are finicky and it is hard to get big precision out of them. Second, computing the difference and comparing it to zero makes it impossible to enforce consistent precision requirements when the possible values span a large range. For example, let’s say we want to test identity f(x) == 10^x. If f(0) computes 1.001, the difference is 1e-3 and we almost certainly want to mark this as a failure. OTOH if f(11) computes 100000000008, the difference is 8 but for most practical purposes we would probably want to mark this is passing the test.

Got it, amkes sense. I guess it’s not that easy.

That’s how we were gonna get gradients regardless though, right?

The relative error is on very different scales here.

But overall, I understand that we’re going to have to do more than finite differences if we want high precision tests, especially at boundaries/poles.

No, we are gonna get gradients by Stan’s autodiff.

Yes, that was precisely the point I was trying to make. Computing the difference and comparing to zero loses that information and that’s why I don’t want to do it.

Not sure how to best communicate this, but on the one hand, I am happy I get some feedback and that you care, but on the other hand it feels a bit like I am just repeating stuff I already wrote and that my ability to get a point across is very low. Not sure though how to improve this, but wanted to point this out.

I think Bob gets it each time but is just very busy. He and I had a similar error tolerance discussion in another thread, so I know he’s well aware.


Message boards are a challenging medium. Are you waiting on something from me or someone else here?

Yes, I think I get the concern. I do understand the testing framework doesn’t test everything one might want to test and can use improvement or an orthogonal framework to test other things.

In my opinion, if it behaves as you described in the other thread, where the tolerance is relative unless any small numbers are involved and then it becomes absolute… then you’ve basically covered everything.

Not waiting on anything except my own time to move it forward.

The challenging part for me was that I felt like nobody really tried to understand what I propose and almost all the discussion was about something else than the gist of the original post. And for the only questions that I believed were on target, I gave answers (at Separation of concerns in autodiff tests) and then got no further feedback while people continued to respond to what I believe were minute details raised by others. It felt as if my thoughts are invisible. It is well possible I didn’t explain myself well, but it would be great to at least hear something like “(quote) is where you lost me”.

@ChrisChiasson we talked about error tolerance at `expect_near_rel` behaves weirdly for values around 1e-8 and my proposals for change were merged in . Please note that the proposal I made here is completely orthogonal to how error tolerance is computed, the error tolerance discussion in this thread started from a tangential concern.

1 Like