Separation of concerns in autodiff tests

A few minor comments:

I like expect_values better, though I guess the point is it does all the double/T combinations to make sure everything is properly templated.

In terms of how Stan is written in terms of prim/rev/fwd/mix it would be more convenient to have individual access to:

expect_ad_var
expect_ad_fvar
expect_ad_fvar_var
expect_ad_fvar_fvar
expect_ad_fvar_fvar_var

But those would be mostly for development I guess.

expect_expression to match the other stuff, or maybe expect_expression_near? This would be quite handy though, thanks.

1 Like

I don’t think I can contribute in useful ways to changes in expect_ad, but if there is future work happening in that area, I’d like to point out a weakness of the current approach (at least as I perceive it). I hope that this adds a meaningful detail without derailing the conversation.

Specifically, if a function throws, then it’s expected that all versions (prim/rev/fwd) throw. This is good, and it has helped us finding and fixing inconsistencies when we switched to expect_ad. However, it would be clearer and safer to have to state whether we expect exceptions to be thrown (assuming a default setting that we do not).

Having that would clarify our expectations when writing tests and not allow bugs in tests themselves to go unnoticed. This has come up recently in https://github.com/stan-dev/math/pull/1698, where the mix tests for quad_form_sym were always throwing (so no real code was being tested) because they were setup incorrectly.

2 Likes

It’s double, var, fvar<double>, fvar<fvar<double>>, fvar<var> and fvar<fvar<var>>.

It goes up to third order and tests both forward and reverse where both are available.

  • third order: fvar<fvar<var>>
  • second order (Hessians): fvar<fvar<double>> and fvar<var>
  • first order (gradients): fvar<double> and var

and it tests both `fvar<varalso tests that the values are the same and that if one version throws an exception, so do all the other versions.

What’s the signature? Would that take in first, second, and third-order derivatives?

Lots more gets tested here. It tests foo(T, T), foo(T, double) and foo(double, T) for all autodiff types T, and for each one it tests values or exceptions match, then if no exceptiosn it tests two forms of gradient, two forms of Hessian, and gradient of Hessian. Basically all the autodiff instantiations we need to do Riemannian HMC.

I think it’d be easier to summarize with templates. What I was proposing was testing all of f(T, T) vs. g(T, T), f(T, double) vs. g(T, double) and f(double, T) vs. g(double, T). Again for all the possible instantiations, checking values, derivatives, and exceptions.

It’s already done for expect_ad. And while it’s a bit tedious, it’s not that difficult.

Isn’t (B) just © done by finite diffs?

The finite diff tests are definitely low precision. Usually only three decimal places for Hessians. The reason I thought that would be reasonable is that as far as I know we don’t have a single specialization of fvar<var> or fvar<fvar<double>>. We could probalby get away with not testing Hessians at all. They shouldn’t go wrong if they’re compilable.

The Hessian and gradient Hessian tests could be made more precise by using finite diffs of gradients rather than double finite diffs. I just didn’t think it’d be worth the effort.

That’s limited to functions we have complex implementations for, right?

I was expecting these to just to be used for double values and everything else to follow automatically from finite diffs. But if we need more precision, this is the only approach.

What signature are you suggesting? We can have expected gradients, Hessians, or gradients of Hessians.

I’m confused becuase (B) just looks like an approach to computing © with finite diffs. how are the goals of (B) and © different?

This is the one I was objecting to on the issue tracker for not covering the foo(T, double) and foo(double, T) cases, which almost always have different implementations for autodiff types T. I don’t see how only testing foo(T, T) is sufficient.

I don’t find this very messy compared to the templating in the actual math lib, but then I’ve been neck deep in templates for years now.

We somehow need to find a way to get the true values for all of our functions and 1st-, 2nd-, and 3rd-order derivatives. Consider a function like matrix multiplication. Even for a (2 x 2) matrix multiplied by a (2 x 2) matrix, the gradient is size 32. And would need to be flattened somehow or the templates get really hairy. But then for Hessians, it’s really big. Right now, the framework does this all automatically, where the exhaustiveness really limits the size of things we can test.

In the past, people have talked about doing reductions, like summing the output of the matrix multiply, so there’s only one output. This might drastically reduce compute time in some of our tests. The issue’s making sure it works.

Another thing to think about that is more important than any of this. Right now, we’re not testing boundaries very well because finite diff doesn’t work at a boundary (we step out both ways in our finite diff). What I did in some tests is compose a constraining transform. It’d be great to have a way to automate this kind of testing.

And then there’s also the idea floating around that we should automate certain kinds of identity checks. As of now, we’re doing that by hand, too, in some places.

There’s an expect_all_throw function that does that. It probably assumed that everything’s already been flattened—I’d have to check.

The AD test framework assumes the primitive tests (not using the framework) have been thoroughly tested for value and exception conditions. So the primitive tests are where to put these and then the same inputs can be used to test autodiff.

I think this is literally just testing case (A), which is now being done implicitly through expect_ad().

I’d particularly like something that only tests reverse mode, because we have a bunch of functions that only work in reverse mode.

Yes, but if the function tested with expect_ad throws because of a typo or else, you won’t be told about that and you may be left without rev/fwd tests for the non-exception cases, as in the bug linked above. I’d rather have the test itself telling me that I’m setting up the test wrongly. :)

Yes, just expect_ad and expect_ad_var would be sufficient for what I’ve needed. Which probably makes it better cause it’s simpler.

1 Like

There’s a lot of useful input on tests in general, thanks for that.

I however have to admit I am slightly frustrated that the discussion almost ignores what I believe are the core questions: How do we make writing test helpers simpler? Do we want test helpers to have clearly separate concerns or do we prefer them to test very broadly and overlap?

So far the only reaction that is IMHO directly relevant to my core concerns is:

Which I interpret as: “We want broad tests and we don’t care a lot for making test helpers simpler to write”. And I am open to this being the best solution, although I remain unconvinced at this point.

I also don’t want this discussion to take more time than it would take to actually implement the test helpers the current (IMHO more tedious and harder to maintain) way. I do care about this as currently it is me implementing those and I have limited time to devote to this. I hope you trust me that I spent some time thinking about this and my proposals are not completely frivolous. Please try to aim for the core of the issue. It is also possible I am just communicating this badly, but I don’t currently see how I can do much better, so I’d like to ask for a charitable re-reading of my proposals.

As I said there are a lot of good points and I agree I missed a bunch of stuff expect_ad is doing, but I think those are minor details (if you disagree, please explain why do you think a particular point is important for the big picture).

Below I react only to stuff I believe touches on the core:

B) and C) overlap only partially, as say the gradients from foo(T, double) and foo(T,T) are never directly compared to each other. Note that I could currently take for example log_sum_exp_vd_vari and multiply its gradient by 1 + 1e-8 (but not do this for the other versions) and expect_ad and the proposed expect_expression_identity would likely still pass for all instantiations as 1e-8 is below most tolerances we use for gradients. The advantage of testing C) by direct comparison between say foo(T, double) and foo(T,T) is that I can have very low tolerance. I agree few bugs are likely to manifest this way, so it is not a big deal. My main goal is in simplifying the testing code and making it more maintainable.

I believe that testing time/memory consumption is currently strongly dominated by compilation, so I wouldn’t worry about this a lot. But please correct me, if I am wrong. As I noted that amount of templates is limiting for some tests we have I guess that reducing templating could actually be a net gain in test time even if we instead construct some large matrices during tests. Both my proposals reduce the number of templates instantiated per test.

If we can put very strict limits on differences between the results (values, gradients, hessians, …) of foo(double, T), foo(T, double) and foo(T,T), then any strong test on foo(T,T) is also a strong test for foo(double, T) and foo(T, double). The proposed expect_instantiations would fail if the differences are anything but negligible. Fruther please note that my Proposal 2 means all instantiations are tested in all cases, but still makes implementing new test helpers easier.

I’m convinced.

From an interface point of view, how about breaking the top level down into two different functions, such as:

expect_ad(f,  ...);

expect_ad_throw(f, ...);

where the former explicitly requires not throwing and the latter explicitly specifies throwing.

1 Like

OK, so since it appears nobody has any opinions, I’ll default to (try to) implement the new test helpers respecting the status quo and will see where that leads.

Hmm, now that I’m looking back at this, can’t this be implemented by just wrapping the current stuff?

void expect_expression_identity(T1 f, T2 g, T3 foo, T4 arg1, T5 arg2) {
  expect_ad([&](auto& arg1, auto& arg2) {
    return f(foo(arg1, arg2)) - g(foo(arg1, arg2));
  }, arg1, arg2);
}

I think in general we want the helpers to be super broad and test everything.

Though when I’m writing code I’d like to break out just the var tests as well, especially for the higher order functions that don’t support anything else and during development. Right now the tests aren’t that modular (in that it’s not easy to pull a little of the functionality out).

Ofc. if tests started running slowly then I’d want to break them out. I think the way you proposed the spit makes sense.

That’s my main motivation. It’d also be useful to test cases we know work differently for reverse mode.

They’re very modular in the code. So it’d be easy to add a flag structure parallel to tolerances that indicated which variable types to test. I was suggesting just making tolerances infinity to turn the tests off, but that’s a hack with not quite the right semantics.

Such implementation would provide only very weak tests. First the finite differences (used in expect_ad) are finicky and it is hard to get big precision out of them. Second, computing the difference and comparing it to zero makes it impossible to enforce consistent precision requirements when the possible values span a large range. For example, let’s say we want to test identity f(x) == 10^x. If f(0) computes 1.001, the difference is 1e-3 and we almost certainly want to mark this as a failure. OTOH if f(11) computes 100000000008, the difference is 8 but for most practical purposes we would probably want to mark this is passing the test.

Got it, amkes sense. I guess it’s not that easy.

That’s how we were gonna get gradients regardless though, right?

The relative error is on very different scales here.

But overall, I understand that we’re going to have to do more than finite differences if we want high precision tests, especially at boundaries/poles.

No, we are gonna get gradients by Stan’s autodiff.

Yes, that was precisely the point I was trying to make. Computing the difference and comparing to zero loses that information and that’s why I don’t want to do it.

Not sure how to best communicate this, but on the one hand, I am happy I get some feedback and that you care, but on the other hand it feels a bit like I am just repeating stuff I already wrote and that my ability to get a point across is very low. Not sure though how to improve this, but wanted to point this out.

I think Bob gets it each time but is just very busy. He and I had a similar error tolerance discussion in another thread, so I know he’s well aware.

2 Likes

Message boards are a challenging medium. Are you waiting on something from me or someone else here?

Yes, I think I get the concern. I do understand the testing framework doesn’t test everything one might want to test and can use improvement or an orthogonal framework to test other things.

In my opinion, if it behaves as you described in the other thread, where the tolerance is relative unless any small numbers are involved and then it becomes absolute… then you’ve basically covered everything.

Not waiting on anything except my own time to move it forward.

The challenging part for me was that I felt like nobody really tried to understand what I propose and almost all the discussion was about something else than the gist of the original post. And for the only questions that I believed were on target, I gave answers (at Separation of concerns in autodiff tests) and then got no further feedback while people continued to respond to what I believe were minute details raised by others. It felt as if my thoughts are invisible. It is well possible I didn’t explain myself well, but it would be great to at least hear something like “(quote) is where you lost me”.

@ChrisChiasson we talked about error tolerance at `expect_near_rel` behaves weirdly for values around 1e-8 and my proposals for change were merged in https://github.com/stan-dev/math/pull/1657 . Please note that the proposal I made here is completely orthogonal to how error tolerance is computed, the error tolerance discussion in this thread started from a tangential concern.

1 Like

I feel that way all the time. One big problem for everyone is just the amount of traffic. But I did read it. I don’t know what else you wanted, which is why I was asking if you needed something else. I still don’t understand from reading this why you want to seperate (A), (B), and ( C) [curse the expansion of (C) into ©]. I completely agree that more of ( C) would be good or even that we could test Hessians vs. finite diffs of gradients. I still don’t know what you mean by the templating being too hard.

We can test expression identity, but it still seems redudnant to me given the other tests. I do think it’d often be useful to see if a function definition matches a simple reference definition, but that’s different.

What’s really helpful in cases like this that are proposing big changes is to write the spec from the user’s point of view. In this case, that’s the function developer’s point of view. That is, work through an example now and see what’s wrong with the current approach from a user’s perspective and what would be better with the new one.

When discussions get derailed, you may also want to go back and look at what you wrote from a user’s point of view. Keep in mind it’s a lot harder for someone else to understand what you’re getting at than it is for you.

The best thing to do for more controlled feedback is to create a design document in stan-dev/design-docs.

1 Like