Remove tests for row vectors (from the distribution tests within the math library)

Hi all,

There were a couple off-line conversations regarding removing tests for Eigen::Matrix<T, 1, -1> from the distribution tests in the math library. I wanted to bring the conversation all back and get some feedback from any developers that are interested.

The question: is it ok to remove testing for Eigen::Matrix<T, 1, -1> (what maps to Stan’s row_vector type) from the distribution tests?

(This was from an email between myself and @seantalts)

The overall goal is to test the distributions over every possible instantiation that is possible from the Stan language.

There are two different things it’s testing:

  1. ability to be compiled. We’ve run into some odd compiler behavior in not being able to resolve certain template instantiations.
  2. correctness of the evaluation of the log probability distribution function + gradients. Since we specialize the gradients, bugs can pop up in certain instantiations and not others.

Both of these two things aren’t just hypothetical concerns. We’ve had to do a lot of rewriting templated code to look a different way to get around the first. The second is truly maddening when it happens, and it’s happened. It’s really hard to trace through a bug that only manifests under specific instantiations.

The basic idea was to have an easy way to specify tests with values that should be ok, values that should throw exceptions, and then all we want would be tested: every possible instantiation. If you look at the first line of test/prob/bernoulli/bernoulli_test.hpp, it starts with:
// Arguments: Ints, Doubles

Using that information, we know that there are two arguments, the first can take on {int, std::vector<int>}, and the second can take on {double, std::vector<double>, Eigen::Matrix<double, -1, 1>, Eigen::Matrix<double, 1, -1>, var, std::vector<var>, Eigen::Matrix<var, -1, 1>, Eigen::Matrix<var, 1, -1>}. So here, we want 16 different instantiations. For something like student_t, we need 8^4 = 4096 instantiations.

(I think we’ve gotten good enough where we treat Eigen::Matrix<T, -1, 1> and Eigen::Matrix<T, 1, -1> the same, so maybe we can drop two instantiations (one for var and one for double). But we should instantiate std::vector and Eigen::Matrix; they flex the compiler slightly differently.)

In short, I think the compilation errors should be picked up by a combination of unit tests of the template metaprograms and the instantiation with Eigen::Matrix<T, -1, 1>. (C++ never ceases to amaze me, so perhaps someone could tell me that’s not true.)

Sorry about not keeping this discussion on-list. @Bob_Carpenter reached out and said:

It’s definitely good to get everyone’s opinion on
removing row vector tests. In the end, Daniel (cc-ed)
is in charge of the math lib. If he’s OK removing row
vector tests, I am—Daniel’s been the one most adamant
about test coverage.

I think the way we write the univariate distributions, we’re safe. (There was a point in Stan’s history where we wrote each template specialization by hand; that’s when we could have bugs that only manifest in one of the instantiations. Now that we use the template metaprograms, we should be ok as long as we use the metaprograms correctly.

Is there a way to make them optional (so they could run manually) without bit rot, rather clean them out in case we’re wrong here? I’m guessing no, so maybe just tag the last version they still exist so it’s easy to grab them later if things go south.

1 Like

Here’s the change - for now it’s easy to revert and run the old version if you want. I’m also hoping to rewrite these tests completely in the future to avoid code generation and ideally be faster. I’ll make it easy to change whether you’re testing against row or column vectors or both.

1 Like

Sounds like a win!

We had a discussion in the meeting today where we talked about a few things, including:

  1. Being smarter about what gets tested depending on what’s changed. Bazel was proposed as a potential solution, but we’re not sure it can save us given the way we don’t “include only what we use”
  2. Running the distribution tests only when we merge into develop and auto-reverting. This was rejected for fear of one day causing issues for a developer during the 10 hour window when develop would have something that causes distribution tests to fail but not other tests.
  3. Removing the row vectors from the tests. Everyone except for @betanalpha was on board with this. His objection was that we don’t have anything in place that prevents people from checking in code that doesn’t treat row vectors the same way it treats vectors, and he’s right especially in the cases where we’re introducing new vectorized regressions and other speed enhancements.

I proposed to go ahead and cut the row vectors out and work on adding a weekly test on develop that tests the full suite. I might just include this in my PR to remove the code generation from the distribution tests.

I agree that Bazel requires us to be better about iwyu.

I think it is also the case that the amount of time spent on the distribution tests can be reduced through better values of N_TEST. The default one was chosen to make things work on Windows and with mingw based on g+±4.6. You can go much higher otherwise.

1 Like

@seantalts, thanks for the summary .

I’m glad you wrote it down. I heard that as one of the proposals, but I heard us coming to an agreement with:

  • having an easy ability to test with and without row vectors
  • removing row vectors from testing on each pull request prior to merging
  • testing row vectors on develop as it gets merged, on each merge

That sounds a little different than what you proposed.

Just wanted to make it clear that this isn’t a fear, but a reality. There are plenty of pull requests that would have been merged and have tanked the build, even over the last year, if we didn’t check before it got merged. Not having develop in a good state is actually very disruptive. It’s not the 10 hour window that’s the issue. It’s really the window in which you grab develop. The assumption we have with develop is that it’s in a working, releasable state (with very few, known exceptions). Developers should not have to test the state of develop before working on things.

Relaxing that assumption on the state of develop isn’t really an option when there’s a distributed team.

Yup. We’re going to have to be even more vigilant in checking pull requests. Maybe we can include things in a template somewhere so we don’t forget to check?

I knew about this risk when I said I was ok with removing the tests for row vectors. I think if a developer starts adding specializations that are custom to particular types, it makes sense to require additional tests that flex those types.

Sorry we miscommunicated - we weren’t far off, but I think I failed to convey I’m not planning to do any more work on the legacy code generation framework in the meeting. There’s also the minor difference of whether it’s a weekly check or a on-merge check - I’d be fine with the latter for now.

Regarding these distribution tests more broadly, I agree with you that having develop in a bad state would be disruptive. However, we need to acknowledge that there’s a cost-benefit trade-off that needs to be made, which is I think what we need to work through. My thinking is that we need to weigh:
A. Frequency at which branches with issues that would be caught by the check are merged & then pulled by developers (before the check & revert)
B. Impact of catching & reverting.
C. The costs of these tests, not just in compute resources but in dev time, PR lag, dev engagement, etc.

In my estimation, A + B < C. I think the only way to find out is to run an experiment and measure the impact. We can do this on a small scale with the row vector thing but I would actually prefer to just change the entire full distribution test-suite to on-merge and measure that impact over a trial period of some months.

1 Like

Yup. We’re definitely not far off. And I believe this is the first I’m hearing about not planning to do more work on the legacy code thing.

I’m for merging, but with testing the develop branch with the reverted tests to address @betanalpha’s point.

Ok. Before merging the pull request, can we make sure to have the reverted tests run and in place?

Hmm. The short is, I agree with evaluating the cost-benefit trade-off, but I disagree with your estimation. I think there are two things at play that we shouldn’t conflate:

  1. row vector testing
  2. in general, changing the philosophy that develop should be in a good state to something where it usually is in a good state but it’s acceptable for it to be in a bad state

For row vector testing, I believe it’s ok to relax the tests for pull requests because we’ve reasoned through it. The places we want to catch with the testing framework are:

  • C++ instantiation (which will be covered by the unit tests for the meta traits and the Stan tests that instantiate all the distributions)
  • gradient evaluation for vectorized and non-vectorized instantiations; as long as we use the template metaprograms for building up the gradients (VectorView) for all the instantiations, we will be ok. There isn’t a way to write code that branches on Eigen vectors and row vectors using that construction.

If anyone does write a specialization just for Eigen row vectors, we will need to be vigilant and not allow the distribution test framework be the only test because at that point we would be missing tests for correctness. But I think that makes sense.

So… for the row vector testing, I think we’re good (and we don’t need to call it an experiment).

Here’s where I stand regarding allowing develop to be in a bad state. Here’s some reasoning with the hope that we can have a productive discussion. (I’m not digging in my heels; I’m hoping that by explaining some of the real concerns / problems, we come to a more central place.)

  • In my estimation, B >> C.
    • The impact of having develop not being in a trustworthy state (in order to expedite pull requests or some other reason) limits the ability to have a distributed development community. Currently, we’re all (implicitly or explicitly) responsible for keeping the math library in a dependable, working state. The assumption that when we check out develop, we don’t need to test it because it works. When we add things on top of it, the only places where it could be broken are by the new additions that the developer has added, and that by merging the next state of develop that it will continue to work. To not have that assumption in place is tough. And then there’s also the question of who then is responsible for fixing it?
    • I believe that developer engagement will drop off if the code base can’t be trusted. I think the impact of having a shaky code base is much larger than you believe it is. For the math library, the biggest hurdle is the complicated C++ templating coupled with our meta template library. The dev process is a hurdle, but the continuous integration isn’t the first hurdle people are coming up against. Just imagine the difficulty of getting started with our project if you checked out the codebase for the first time and it didn’t compile. Or it compiled, but gave you answers that you didn’t trust.

There was a time before we had continuous integration; it was just Matt, Bob, and myself in an office where someone would commit (SVN) and it would break the build. That commit would disrupt the other two, but it would get fixed immediately (within an hour, but no reverting). It’s really disruptive when the code base stops working because you’re essentially powerless until it gets fixed. We did that for a while until it killed our productivity. Then we added CI and got serious about having a stable code base. There are just too many points of failure for us not to trust the state of develop.

Anyway, I’m happy to continue this discussion about the general philosophy. We should go ahead with the tests.

Regarding section two in your post above (“allowing develop to be in a bad state”):
I recognize that your estimation is different than mine and think we should attempt to actually measure it, and the only way I can think of is with an experiment.

It’s a little bit ironic because there are a lot of things that can cause develop to break now and fixing them is pretty much incompatible with our current resources without relaxing what gets tested on develop vs. in a PR vs. weekly. Here are some example things that have allowed develop to be in a bad state:

  1. Upstream tests not executing the full upstream test suite
  2. PR tests running against the PR branch rather than against the result of a PR merged with the latest develop
  3. Not running tests against PRs by people not in the Stan github organization

Fixing these ends up running tests more often, but the 2nd one is especially pernicious - the new pipelines do this automatically, but what this implies is that every time develop is updated every outstanding pull request is re-tested. That’s ~150 hours of tests queued up for every merge to develop right now on Math with 15 open PRs.

This is all to sort of argue that AB<C again, but I think rather than argue we should just try moving the full distribution tests to a once-a-week cron job and see if that (plus fixes to the above) actually improves the state of develop.

I’ll suggest we separate the “allowing develop to be in a bad state” to a separate discussion with the appropriate gravitas it deserves. I don’t think we, as a community, should be changing our stance lightly on this point. (We could also experiment further down-stream, like CmdStan or Stan… or wait for a natural experiment because it will happen.)

Umm… I think we have different assumptions. Here’s one of mine that I guess I’ve never really written down or communicated anywhere:

  • pull requests are typically independent. Testing multiple pull requests against develop independently should work almost all of the time. There are times when we need to use common sense and test again after merging develop. And there are definitely times when we should kick off all outstanding pull requests, but that’s what I’ve believed.
  • the Math library develops pretty slowly. I just checked and it looks like we merged 105 pull requests in the last year. (search for is:pr is:merged updated:>=2016-10-24 in the issues). (Whoa… 2.13.0 was less than a year ago.)

I’d ask that you don’t change this yet.

But… if you’re really adamant about it, I’d ask that you don’t update Stan’s submodule until after the tests are passed. That would at least limit the damage.

I forgot! My suggestion:

  • remove row vector tests
  • add testing with full distribution tests on develop (by using the git revert trick)

Currently the Stan module is only updated after all the tests pass on develop and I’d definitely leave that in place.

Math right this second has 6 different pending pull requests on Jenkins, each of which takes something like 8-11 hours depending. There are other builds queued as well.

If you’re okay with not testing PRs as-merged (and testing them as they exist on their branch), I agree that would lighten the testing load considerably at the expense of some chance of develop getting into a bad state. This helps illustrate why I don’t think this is a philosophical stance that our community holds… it seems to me more like trade-offs we have to make under the hood (because no one else wants to touch this stuff).

What if I came up with a quick distribution test version that doesn’t do combinatoric testing but tests just one instantiation of everything. I’ve noticed in test runs in the past that a lot of errors in the distribution tests could have been caught hours earlier by a fast version, so this was on my own roadmap anyway. Would you be comfortable running a 3 month experiment with this in place for PR testing and then running the full combinatoric version on merge to develop? We can evaluate at the end of the period how many times develop got into a bad state from something the full tests would have caught and how many developers actually experienced this state and how bad it was for them at the end, and see if it’s worth the reduced PR lag (and/or literal $/mo if we start using AWS to deal with spiky testing load).

+1. Great. Thanks for confirming.

That’s fine on Math. Are the other builds waiting for Math to finish?

No, we should test as merged against develop. Now… we were testing a little lighter because we didn’t test against every update to develop, just the one at the time the tests are run. Our typical pull request did end up testing against develop because we usually have changes to make, etc. so in practice, a lot of stuff was tested against the right state of develop.

I think this is a much larger discussion we should have and we shouldn’t hold up the pull request on the resolution of this topic.

Can we start a new thread for this topic too? But quickly:

  • the math library needs to check if the distribution:
    • compiles (instantiates)
    • is correct
  • we can do something like that with what you’ve suggested
  • the alternative is to actually fix the dependency problem. Once upon a time, distribution tests used to use make to manage dependencies and we’d only run the things that needed to run based on changes to the code base. That used to take 30-60 minutes (and most of that was make combing through files on disk). And then when enough things on develop changed, it would need to run through the 8 hours to run the full tests.

If you’re really wanting to do the testing as you suggest, just make sure it’s parameterized so you can always run the full thing. The math library is what everything else is built on. I’m really hesitant to have it become less stable because errors here will propagate.

Anyway, can we end the discussion regarding removing tests for row vectors? Just add the reverted test to the develop branch and we’re good?

Yeah, by default builds queue in the order they were received across repos.

I want to wait until the pipelines are in before I add the full distribution tests (via git revert) to Math’s develop because the pipelines are a lot easier for me to test.

I really do want to do this broader experiment. Is that what the new thread should be for?

PS Here is an example of develop being broken because a PR was tested against an old version of develop and had failures when merged with develop:

(failed because we enabled cpplint on tests in develop after the 649’s initial commits were pushed up and tested).

Yep, but we should have expected it to have broken.

In my mind, this was a policy change: we went from not running cpplint on test files to running them. Once we made that change, all cpplint results needed to be validated again, but not necessarily all of the testing.

Hope that makes sense. I don’t see this as a “normal” pull request. (I had considered bringing up the point that this would affect all other open PRs, but it was already merged and I figured enough people had worked through the impact of the PR.)

I’m working towards an automated setup that runs on autopilot even when we make changes like this. Forcing a person to do PR synchronization like that doesn’t scale well.