Standard for reviewing pull requests in the Math library


#1

I’m going to write down some guidelines later this week for standards for both reviewing and writing pull requests for the Math library this week. This should manifest as an updated PR template and wiki in the Math library.

I haven’t written it out yet, but welcome any / all feedback as it gets written. I know that the quality of review seems (or is) arbitrary (for example, see comments or a recent pull request). Some things have to be open to interpretation, but hopefully we can settle on what should be done in every review and what things the reviewer should ask also.

I’ll try to structure it as general guidelines and a (minimal) checklist. Hopefully we can make things clearer for developers, at least at the Math library.


#2

cc @Bob_Carpenter, @wds15, @seantalts.


#3

Thanks—that was me :-)


#4

History

The Stan project started before pull requests were even a thing. We started out on subversion without thinking about branches or permissions on who could and couldn’t commit. We barely had any test coverage. We had no official process for the longest time–and this was ok. We all wanted to be kind to our other teammates and problems didn’t live in the main code base for long. It was Matt, Bob, and myself sharing the same office and if something broke, we could tap on someone’s shoulder and get it fixed immediately (within minutes).

As the project grew, the process evolved. We moved from our own managed subversion server in Andrew’s office to Google Code and finally to git and GitHub. We also split the single Stan project into many pieces, Math being one of them. As more developers joined (@bgoodri, @Marcus_Brubaker, @Maverick, Peter, @betanalpha, and others), we needed the state of the code to be in a trustworthy state so that our work wasn’t affected and our changes didn’t affect the other work. As we discovered new problems, we added new types of tests to prevent that particular problem from popping up again.

All of that’s just to say that our process is something that emerged, may not be as efficient as something designed, but most things are there for a reason.


#5

Goals

As I’m thinking about the standards for pull requests, I’ve had a chance to reflect on what we’re trying to accomplish. First, the Math library needs to be stable. Alone, it’s not much, but the rest of the Stan ecosystem depends on it working: correct and buildable.

Beyond the code, we want the community of developers that support the Math library to treat the project as an open-source project. Individuals contribute to the Math library so it gets better. There isn’t a sense of code ownership – once code makes it into the repository, we can all modify it and are a bit responsible for it. We want discussions to be in the open and in a place where others can follow along or comment. We’ve already had a lot of wonderful contributions from having open dialogs.

Prototypes

Prototypes are great! The default place for discussions on prototypes should be here on discourse. Sometimes there might be an issue already open and discussion can go there. The goal of a prototype is to show a proof of concept. It’s to show that something could possibly work.

Most prototypes aren’t pull request worthy; and that’s ok! There’s a bit more to getting something ready for inclusion in Stan.

Pull Requests

New features, bug fixes, refactors, and essentially any code change comes in the form of a pull request. We would really like it to follow this wiki: Developer process overview (we may adjust this for Math):

  1. Create an issue on GitHub
  2. Create a branch for the issue
  3. Fix the issue on the branch
  4. Create a pull request
  5. Code review
  6. Address code review (if necessary)
  7. Merge into develop

All development is done by individuals that may have different reasons for creating a pull request, but when a pull request is submitted, it’s really about the contribution to the Math library. Once the pull request is in the Math library, the responsibility lies on all the developers (present and future!).

This is why we want:

  • code to be styled consistently (it’s not about your preference, but about reducing the time it takes others to read your code)
  • contributions to be idiomatic to the language – that’s either C++, doxygen doc, make, C preprocessor macros, etc.
  • readability; the code is written once, but will be read by a reviewer and then again some time in the future. It’s not about how it’s written, but how it’s read.
  • documentation to accompany the pull request
  • tests; we don’t want to add technical debt to the Math library with every pull request
  • design to be reasonable. we want someone else in the future to look at the code and understand the design
  • it to build. This is tough. Right now, we support Mac, Windows, and Linux operating systems and g++ (4.6+) and clang++. We actually have users in Mac that use g++ 4.6, 4.9, clang++ 3.0 / 9.0, and some older version of clang++, Windows users that use g++ 4.6, 4.9, and Linux users that use g++ and clang++. As we move into GPU and MPI territory, we have to keep the math library building for all our existing users.

I know the test burden is high for the math library. I’ll try to give some motivation why:

  • We instantiate the code in all the ways that the Stan library could instantiate the Math library. We do this because we’ve encountered many compiler bugs where things that should work don’t. And this might only show up with a particular version or even on a particular OS. I don’t know of any other way to test it other than to instantiate it.
  • To be responsible to other developers and users. We really shouldn’t push code that’s never been exercised in our tests, at least once. You might be good enough to write code that always works, but I’m not. And if it needs to be changed, tests should be there.
  • For maintenance. Whenever we encounter a bug, we fix it by first recreating it. It’s a lot easier adding a test case when a function has been instantiated with the right types than trying to figure out how to instantiate it in the first place. It’s the difference between setting up something in a couple minutes vs taking a couple hours in the future.
  • For stability. We write a lot of unit tests, but we still don’t have much test coverage. More tests allow us to update the codebase quicker while trusting the results.
  • We also check output messages. These actually make it out to the user, so it helps to be consistent and make sure these are ok.

In short, we’re trying to contribute something to the Math library that’s good and will not incur much technical debt. It also helps if the contributor takes time to guide the reviewer by highlighting the design or the major changes.

Code Reviews

Code reviews should be taken seriously and is carried out by developers of the math library. One or more developers have to review pull requests; most often, they might review different parts of the pull request. For example, I might look at the changed build instructions while someone else looks at the template metaprogramming.

When evaluating the code, we should be evaluating it from the point of view of the Math library and defending it. I always assume the contributor has a good reason for putting effort into a pull request – the content of the pull request does something good for the project. The review is a chance for us to evaluate whether that’s true and whether it’s going to be maintainable.

We should be thinking about how easy it is to read the code. Once it’s in, the collective developers have the responsibility of maintaining it. We should think about how hard it would be to replace the code later on. We should be asking if there are easier ways to do things so that it’s less complex. We need to be looking at design. We should also ask if there are enough tests – at least to convince yourself that it’s right (enough) for users to push on all the corners of that code. The litmus test is whether you’d be comfortable hunting down a bug in that code if it were to break.

Code breaks. We’ll deal with it. It’s a matter of how easy or hard it is to deal with later on. I’m never a fan of “we’ll document it later” or “we’ll test it in a future pull request” – that technical debt almost never gets paid off. Even with the best intentions, the individual that contributes code isn’t always the one that fixes it. That’s part of the challenge of open-source software.


#6

P.S. I didn’t think though how this is all going to be stitched together. I just started writing. I’ll try to summarize and get it into wiki + pull request template form.


#7

Thanks for writing this up. I think it’s a good perspective, and I especially appreciate the goals of readability and maintainability.

I’d like to frame this discussion in terms of costs and benefits. All else being equal, more testing is better. But all else isn’t equal. Some of the costs we need to take into account are:

  • time it takes to write the tests delays release of features being tested
  • maintenance of tests as interfaces and other underlying patterns change take time to fix proportional to the size of the test suite (if tests grow linearly and time to fix at big events is proportional to the size of the test suite, overall maintenance cost is quadratic in the size of the test suite)
  • changing rigid tests impedes refactoring
  • user local and CI testing slows down the overall development cycle
  • opportunity cost of not developing other features instead of testing existing features
  • poor test code is hard to maintain, but good test code is more time consuming to write

The right balance for Stan is somewhere between the reliability of autopilot software for airplanes and the usual academic research software, but I’m a little perplexed about how to define where the optimum point is.


#8

Some other things I was thinking about:

Testing

There’s a lot of questions about how much testing. I think I have some clarity now.

We’re writing a lot of templated code. We can’t test every possible instantiation possible (that’s things like other primitives like float or other types like std::complex or even another autodiff library). What we do need to test is that it works for every instantiation that’s possibly generated from the Stan language.

I’m actually going to take that one step further and say we also need to test it with hopes of supporting forward mode. If not, we’ll never get there.

In practical terms, that means for any templated scalar argument, we should be testing for (I’m dropping the stan::math namespace for var and fvar:

  1. double
  2. var
  3. fvar<double>
  4. fvar<var>
  5. fvar<fvar<double>>
  6. fvar<fvar<var>>

It gets to be lots of testing on multiple argument functions. Fortunately, it’s not all possible combinations. If there are multiple arguments, it’s really a combination of double and one of the Stan types, but we don’t mix types. These are the instantiations we should test for two arguments:

  1. (double, double)
  2. (var, double)
  3. (double, var)
  4. (fvar<double>, double)
  5. (double, fvar<double>)
  6. (fvar<var>, double)
  7. (double, fvar<var>)
  8. (fvar<fvar<double>>, double)
  9. (double, fvar<fvar<double>>)
  10. (fvar<fvar<var>>, double)
  11. (double, fvar<fvar<var>>)

Once again, it’s a lot of testing. There are a few reasons we test against all the instantiations when we can:

  • compiler bugs. I really wish this weren’t a problem, but it is. I don’t know of any other way we can reproduce those bugs other than instantiating it. We’ve seen some crazy stuff over the years. The scary bugs are the ones that compile and run, but return non-sensical results. That happens.
  • different behavior based on types. In a lot of our fast autodiff code, a lot of the computation of both the function value and the gradients use double (or really, using the scalar_type of the variable). This means that different instantiations have different behavior and testing under one instantiation doesn’t actually test another.
  • protecting the behavior of something that works. We don’t have a full integration test suite within the Math library. Just a few tests prevents implementations in other parts of the code base from propagating bugs elsewhere.
  • maintenance. It’s a lot easier to track bugs when there is one unit test that’s already there. That can be extended very easily and a new test case can be added. It makes it much easier to refactor code. Code without tests are very hard to change because it’s unclear what behavior needs to be maintained.
  • common courtesy. As a common courtesy to your fellow developers (and that could be yourself in the future), it helps us, as a community, to continue testing everything. As a common courtesy to the users, we shouldn’t ship code that hasn’t been instantiated. Time and time again, I’ve seen code submitted as a pull request that hasn’t been run, even once. As soon as we run the code, it fails. We don’t want it to get out to the user when it fails.

Here’s when I think we don’t have to test all the instantiations (even though it’d be nice to have): when we rely on autodiff for the implementation. That means no switching based on types. We should have tested the underlying operators and functions well enough that we don’t need to do the tests for all the instantiations. Of course, as soon as we start introducing faster code that takes advantage of template metaprograms, we need tests to cover all the instantiations.

Bob’s been working on a test suites that make this a lot easier for the developer to cover these instantiations. At some point, we should design something that serves our purposes. The distribution tests, the way they are now, are more clunky than it needs to be and it’s very difficult to maintain.

Some miscellaneous thoughts on testing:

  • for any function that does non-trivial things to the autodiff stack (nested autodiff, preallocating memory for matrix operators), we should test that the stack sizes are as expected
  • we don’t have a good way to performance test math functions. We need to do that.
  • functions that don’t rely on autodiff should never leave things on the stack.
  • we haven’t made everything thread-safe. I doubt we could just multi-thread everything and it’ll work.
  • we should list what C++ compilers we test on. And what C++ compilers we will try to support (as in, if a user reports a bug with a compiler and it only affects that compiler, we’ll work around the issue).

#9

Thanks—that’s a very helpful list. I agree with pretty much everything on it, but wanted to clarify a couple things.

Should we step back and evaluate whether we have the cycles to ever get there? We haven’t even tried for the ODE solvers.

That’s exactly what I’ve done in the testing framework, with exactly the six types you layout. But so far, I only see three types that’ll see use in Stan with our current algorithms:

  • double: for generated quantities evaluations
  • var: for Euclidean HMC
  • fvar<fvar<var>> for Riemannian HMC

I don’t anticipate using other types.

The one place we got burned with this is with the gamma functions. We didn’t have a differentiable digamma or something like that so it failed to instantiate.

Right—the single autodiff stack would get corrupted. What we’d need to do is make the global storage for reverse mode thread local. It worked the last time I tried it.