[edited as this describes June 5]
Hi everyone,
As you are all likely aware, sometimes I complain about how my life is made more difficult by our current 3-repo structure (or sometimes I point it out when it affects others) w.r.t. issue tracking, wikis, releases, end-to-end testing of various kinds, etc. After talking with Bob, I have a solid monorepo proposal I want to get comments on (hence this Request For Comments thread).
Options
I think there are two options for reorganizing CmdStan, Stan, and Math into a single repo - One is to essentially have something like the existing CmdStan, Math, and Stan modules intact (i.e. separate makefiles, Jenkins, Travis). The other is to do a proper merge and re-use the pieces. I [now] think we should do the first one as much as possible.
We can reorganize into the following top-level directory structure:
math
language
algorithms
services
cmdstan
Bob points out that dependencies flow downwards in this tree, and there is an operating theory that tacking this restructuring onto a move to a monorepo benefits from economies of scale so we might as well do a code-level directory reorg at the same time as our repo reorg.
Operationalized
Specifically, this would mean replacing the lib/stan_math submodule with all of the files from math, ideally with all of their historical commits, moving it and the other Stan directories into the appropriate places (perhaps all under a src directory?), and add a new directory for CmdStan with the same git fu in the top-level.
We will also need to run or write scripts to migrate over issues, PRs, and wikis from other repos into the monorepo. We will use this technique to merge git repos while keeping history, though this will rewrite the repo and force people to force pull.
We will keep the build system (makefiles) and Jenkinsfiles separate, though we’ll have to merge the travis files (see below).
Places where this will affect workflow
Permissioning / approved reviewers
We’ve been wanting to be able to assign Michael as algorithms Tsar for a while now but no one has taken the time to dive into github’s directory-based permissioning[1], which should support this use case and our monorepo use case quite nicely. We may even get to fine tune some additional painpoints we’ve been having (around wanting multiple reviewers for a PR that touches multiple distinct parts of the code).
Developer environment
The current situation for most people is to check out cmdstan recursively, giving us stan checked out under cmdstan and math under stan/lib/stan_math. We then often muck about with make stan-revert
, make math-update
, and related ilk but these scripts have some rough edges and I think we’ve all been burned by what turned out to be a partial clean or checkout. And obviously any PRs that do legitimately span these 3 repos will now be much, much easier - we should be able to avoid the very mild catastrophe that was the cvodes -> sundials merge as well as letting us much more easily test that a cross-repo PR works.
Github issues
We’ll need to copy all of the issues over from the other repos to the Stan repo, and we’ll probably want new tags like math
and cmdstan
automatically applied to the correct incoming issues. Henceforth we can use those categories (and existing ones like language
or algorithms
) from an integrated system instead of linking out to other repos and messing around with Chrome extensions to move issues from one repo to another.
New developers
One of the original motivations for splitting out the repos was (reportedly) to make it easier for newcomers to just checkout and develop on the part of Stan they wanted to change. But it turns out that most people’s first task involves adding something to the language, which usually involves adding something to both Math and Stan/lang. This benefit did not materialize and I think a monorepo structure is actually much easier to deal with for new people (think of the number of people who commented to say our manuals and wikis all needed to mention that a recursive git checkout is required, for example).
End-to-end testing
Whenever you want to test Stan end-to-end, you need to spend a lot of time fiddling with your submodules and cleaning in order to make sure you actually testing and comparing the right versions, and if you check out two copies this is now 6 git pointers to manage. This should become much easier now and has the nice side effect of making our existing performance testing tools much easier to use.
Builds and CI systems
We’ll keep things federated in separate modules for now - CmdStan, Stan, and Math will retain their sets of makefiles and make/local
files as well as their Jenkinsfiles. Travis doesn’t support this so we’ll need to create a merged .travis.yml
that triggers different tests for each subdirectory using this trick. It might be pretty annoying to code up so it might be time to finally abandon Travis…
Releases
For those of you unfamiliar with the release process, this should make releases ~3x faster since most of the work there involves scouring through the three repos for issues and PRs that were closed and included in the release, as well as point and click uploading of artifacts, release notes, and doc.
Anyone have thoughts on Option 1 vs 2 (proper merge vs more independent modules?) Other thoughts on this idea? Do you think we should double down and go to 5 repos instead (this was at one point a proposal before we knew about per-directory github permissioning)?
[1] directory-based permissioning - https://help.github.com/articles/about-codeowners/