RFC: Return of the Monorepo

[edited as this describes June 5]

Hi everyone,
As you are all likely aware, sometimes I complain about how my life is made more difficult by our current 3-repo structure (or sometimes I point it out when it affects others) w.r.t. issue tracking, wikis, releases, end-to-end testing of various kinds, etc. After talking with Bob, I have a solid monorepo proposal I want to get comments on (hence this Request For Comments thread).

Options

I think there are two options for reorganizing CmdStan, Stan, and Math into a single repo - One is to essentially have something like the existing CmdStan, Math, and Stan modules intact (i.e. separate makefiles, Jenkins, Travis). The other is to do a proper merge and re-use the pieces. I [now] think we should do the first one as much as possible.

We can reorganize into the following top-level directory structure:

math
language
algorithms
services
cmdstan

Bob points out that dependencies flow downwards in this tree, and there is an operating theory that tacking this restructuring onto a move to a monorepo benefits from economies of scale so we might as well do a code-level directory reorg at the same time as our repo reorg.

Operationalized

Specifically, this would mean replacing the lib/stan_math submodule with all of the files from math, ideally with all of their historical commits, moving it and the other Stan directories into the appropriate places (perhaps all under a src directory?), and add a new directory for CmdStan with the same git fu in the top-level.

We will also need to run or write scripts to migrate over issues, PRs, and wikis from other repos into the monorepo. We will use this technique to merge git repos while keeping history, though this will rewrite the repo and force people to force pull.

We will keep the build system (makefiles) and Jenkinsfiles separate, though we’ll have to merge the travis files (see below).

Places where this will affect workflow

Permissioning / approved reviewers

We’ve been wanting to be able to assign Michael as algorithms Tsar for a while now but no one has taken the time to dive into github’s directory-based permissioning[1], which should support this use case and our monorepo use case quite nicely. We may even get to fine tune some additional painpoints we’ve been having (around wanting multiple reviewers for a PR that touches multiple distinct parts of the code).

Developer environment

The current situation for most people is to check out cmdstan recursively, giving us stan checked out under cmdstan and math under stan/lib/stan_math. We then often muck about with make stan-revert, make math-update, and related ilk but these scripts have some rough edges and I think we’ve all been burned by what turned out to be a partial clean or checkout. And obviously any PRs that do legitimately span these 3 repos will now be much, much easier - we should be able to avoid the very mild catastrophe that was the cvodes -> sundials merge as well as letting us much more easily test that a cross-repo PR works.

Github issues

We’ll need to copy all of the issues over from the other repos to the Stan repo, and we’ll probably want new tags like math and cmdstan automatically applied to the correct incoming issues. Henceforth we can use those categories (and existing ones like language or algorithms) from an integrated system instead of linking out to other repos and messing around with Chrome extensions to move issues from one repo to another.

New developers

One of the original motivations for splitting out the repos was (reportedly) to make it easier for newcomers to just checkout and develop on the part of Stan they wanted to change. But it turns out that most people’s first task involves adding something to the language, which usually involves adding something to both Math and Stan/lang. This benefit did not materialize and I think a monorepo structure is actually much easier to deal with for new people (think of the number of people who commented to say our manuals and wikis all needed to mention that a recursive git checkout is required, for example).

End-to-end testing

Whenever you want to test Stan end-to-end, you need to spend a lot of time fiddling with your submodules and cleaning in order to make sure you actually testing and comparing the right versions, and if you check out two copies this is now 6 git pointers to manage. This should become much easier now and has the nice side effect of making our existing performance testing tools much easier to use.

Builds and CI systems

We’ll keep things federated in separate modules for now - CmdStan, Stan, and Math will retain their sets of makefiles and make/local files as well as their Jenkinsfiles. Travis doesn’t support this so we’ll need to create a merged .travis.yml that triggers different tests for each subdirectory using this trick. It might be pretty annoying to code up so it might be time to finally abandon Travis…

Releases

For those of you unfamiliar with the release process, this should make releases ~3x faster since most of the work there involves scouring through the three repos for issues and PRs that were closed and included in the release, as well as point and click uploading of artifacts, release notes, and doc.

Anyone have thoughts on Option 1 vs 2 (proper merge vs more independent modules?) Other thoughts on this idea? Do you think we should double down and go to 5 repos instead (this was at one point a proposal before we knew about per-directory github permissioning)?

[1] directory-based permissioning - https://help.github.com/articles/about-codeowners/

1 Like

As someone who has been diving into Stan development relatively recently, I would like having monorepo (I had to delete and re-clone repositories multiple times to resolve submodule pains I couldn’t get my head around). The obvious question is: why to stop with CmdStan? Shouldn’t all other interfaces (at least those maintained in stan-dev) be included as well? The discussion of cost and benefits seems almost identical, but maybe I am missing something.

2 Likes

I’m in favor of either option.

The main motivation for the split was requests for having smaller packages. I think the benefit of the simplicity of downloading a complete tarball greatly outweighs the split now.

Can you walk through the plan of how this would happen? It’s going to affect every open branch and pull request across multiple repos.

And the motivation for only including CmdStan as opposed to all the interfaces?

Btw, awesome. Code owners was announced almost a year ago: https://blog.github.com/2017-07-06-introducing-code-owners/

This proposal definitely couldn’t have happen as easily without it.

1 Like

It’s mostly a social reason - CmdStan, Stan, and Math are always released at the same time in lock step by the same person. I think they make more sense as a unit than they do separately, and I think there’s a stronger argument for just these 3 than for all interfaces (especially ones that aren’t part of the stan-dev organization on github - we want to allow people to do whatever they like with Stan!). I think it could make sense to include whichever interfaces are interested in being included in the main repo, but it’s by no means necessary and I don’t think it would actually improve the day-to-day operations since the development workflows and policies (and people) are different between the interfaces and the main three.

I think the plan is to move all of those things over as written about with some vagueness in “Operationalized” above. We can tell people when this cutover will happen with the knowledge that any work done in github during the cutover could get lost.

Thanks. I’d like to see the “Operationalized” section much more flushed out.

I think in the recent months, I’ve seen a lot more developers want to have an end-to-end test than before. Previously, Math improvements were done independently of Stan and CmdStan. In short, you have my support for the idea. Just want to know how this will actually happen.

Btw, if we’re going to be keeping history, I have one request. When we transferred from Google Code, we lost the history. It’s there, but the first commits are not attributed to the correct user due to the committer email being mangled. It’d be really nice to fix this while we’re at it.

I can give this a shot while I’m at it. Do you know who they should be attributed to?

Do you have specific concerns? Like which stuff do you want detail on and how much? I think some of it will be stuff the implementer will discover on the fly and it might be enough to align on goals and constraints rather than specific steps that could be subject to change… but definitely want to address any concerns beforehand especially as they might help flush out our priorities and constraints.

Absolutely! Running git log --reverse in the Stan repo shows emails like this:
bearlee@alum.mit.edu@9a304bd9-dce1-f7c0-8d5c-bb1642157d4e

The part after my email is the original svn hash, I think.

I’m thinking about stuff like:

  • how are pull requests that span multiple repos handled?
  • how do we tell committers and reviewers that they have to be more careful about scoping the contributions more carefully since it can actually affect other repos?
  • who merges?
  • how are PRs triaged?
  • what labels will there be on issues and pull requests?
  • what do we do about all the existing PRs and open issues?
  • will we have one common standard across the mono-repo? If not, then why wouldn’t the other interfaces be part of this?

Not all of that has to be answered. I’d just like to know that the effect of this change has been thought through reasonably before embarking on it. I’m still supportive of it.

You mean migrating PRs that currently span multiple repos into the monorepo? I think a decent answer here would be to just blindly automatically copy and then ask authors of any PRs that span to merge them themselves. I don’t think we have any open right now so that’s good.

We can still use the wiki to communicate with contributors and potential contributors about our processes and guidelines. I’m not sure what you mean here by “affect other repos” - currently they affect other repos but they actually won’t really anymore once we move to a single repo.

I meant the permissions section to address this - we can use CODEOWNERS to say who has permission to merge for a given directory.

How are they triaged now? It shouldn’t change - the only thing changing is that we’re going from 3 versions of history, issues, releases, and wikis to a single one.

That’s in the doc above - we can have labels for the major components (“language,” “algorithms,” etc) we expect the issue or PR to touch. If we’re okay with that I’d propose batch tagging the existing ones with their appropriate repo tag so we retain the original repo designation in case we need it later, if nothing else.

Migrate all branches and issues (open and closed) in a scripted fashion over to the new repo. This might require writing a new script with the web API (or using something existing). I think this will likely end up collapsing the discussion into the text of the original issue rather than as individual comments (since we can’t comment as other people).

I think we pretty much have a common standard across the 3 repos now, yeah (to the extent that we don’t it’s been more because they are in separated repos and tooling like autoformatting needs to be modified for each repo). Do you disagree with that? I’m thinking here of standards like “nearly all pull requests should be reviewed by someone,” “Google style guide and autoformatting is good,” etc. I think this is one reason I view the interfaces as being independent and don’t think we need them in. I think this makes more sense as you try to generalize this to interfaces like ScalaStan, which is developed mostly for and by one company (so far) to their own standards. I think this federated nature is pretty common for wrappers and that they naturally should have less coupling than e.g. Stan and Math naturally have. CmdStan is sort of a default example interface for now, until we make it into ServerStan or whatever, in which case it also goes inside the API dotted line and interfaces just interact with that.

Thanks for thinking this through and writing it up. I’d completely forgotten about things like the issue trackers and history in only thinking about the target workflow.

First, I’m in favor of combining math, stan, and cmdstan into a single repo as you suggest.

Second, I like the proposed directory structure (and would propose not combining the scal/arr/mat flattening into the same move) for the source.

Wiki and README

These need to move, but most of them are already in stan. Those need a housecleaning, too.

License

This is all consistent across these three repos.

Unit tests

Are we going again with something like

src/stan/{math, language, algorithms, services, cmdstan}
test/unit/{math, language, algorithms, services, cmdstan}

I was never particularly happy with that because of lack of parallelism.

Upstream tests

Are there any upstream tests of RStan and PyStan from within Stan at the moment?

Makefiles

Will there be one top-level makefile or multiple ones? Where will they live?

Doc

Same question about where it goes.

What do you mean about parallelism? Any suggestions for alternatives?

I hate make; it doesn’t really have tools for this as far as I know. I propose a top-level Makefile and make folder because I think we can re-use a lot of code. We’ll need new variables for any C++ flags that are CmdStan, Stan, or Math specific. If anyone (esp @syclik) feels like we should keep the Makefiles modular and not have a top-level, I would also be fine with that especially since it’s much easier to implement and makes a logical first step.

Doc I will propose goes in doc/{math, language, algorithms, services, cmdstan} but again, I don’t have a strong belief.

In some sense I am the wrong person to propose the reorganization of the source code and other directories - I really care most about the git repos, issues, releases, and wiki being consolidated and not much about alternative directory structures within that.

I mean that ione has src/stan and the other test/unit whereas they’re both source and they’re both in the Stan hierarchy in some sense.

Do you want to hold this up while figuring out a replacement?

Could you roll this back into a unified proposal? I just think it’ll then be easier to bring other people in and tell them this is what’s happening. I don’t think anyone will object.

I’m not that picky as long as it’s consistent.

I suspect it will impinge on how easy it will be to release just the math library or manage permissions.

If we have src/{math, stan}, test/{math, stan}, doc/{math, stan} etc., then we’re spreading what we’re thinking of as an independent module, namely the math library, throughout the whole repo.

If we have math/{src, doc, test}, stan/{src, doc, test}, etc., then we keep the modules together, but not the functionality of doc, src, testing etc. The goal shold perhaps be to minimize all the complicated relative paths back and forth.

I see about parallelism.

A lot of proposals get screwed by letting perfect be the enemy of good. I bet I could find a circular dependency chain here (and in most of them) pretty easily, too. I actually want to revise my proposal to be more minimal - restructure into the directories you listed and maintain doc, build system, test, etc modularity within each directory. I think I will edit the original post to reflect this and add a note that I did that. This will be easier and makes more sense as a first step - we can always refactor the makefiles later on. We have enough work cut out for us in just merging the history, issues, wiki, and releases without requiring that we totally refactor and fix our build system too. I think this should even extend to Jenkins, though travis doesn’t support a modular system like this so we’ll have to work to combine them into a single travis file.

1 Like

@Bob_Carpenter, we can fix this easily. Let’s open up an issue for this separately. (There were always two considerations: 1) specifying exactly how everything gets laid out and 2) this would wreak havok on your navigation of the tests for a while. I tried not to move files and invocation of tests lightly.)

Clarification on what tools are needed?

I don’t like make. Please replace. But verify that it works before swapping it out. There have been a number of incomplete attempts and dropping in something broken is worse than having make.

I support your updated proposal. The first thing should be about putting together the monorepo with the transition being as smooth as possible for the developers and users. Things in the repos should change minimally and only to support this action.

1 Like

I don’t want to tie monorepo to replacing our build system, and yeah I want to do that carefully if we ever attempt it. There’s a cmake branch on Math that Dan Luu estimated needed another week of work, but it doesn’t seem like a high priority(?)

I was just referring to make’s lack of tools for subprojects. cmake has its own devils but has this built in.

1 Like

Yes, it doesn’t have that capability (at least not built-in). I’m weary of stretching make beyond what it’s good for – that’s caused us a lot of trouble in the past (not make itself).

Exactly why I asked :-) This is really the reason to write functional specs out—just to scope out what’s being done and prioritize it all.

Sounds good. That’s halfway to one of the original extremes of just copying over directories exactly where everything’s at now. We could also just do that and do this whole thing in stages. I think you and @syclik should have a better handle than me on how hard all the build stuff would be to move.

I don’t think it’s even ready for an issue because I don’t know what the best way to do it is. I’m totally OK leaving things as they are now and refactoring into one big repo in stages that are as small as possible.

I think that’s the right decision. I was just trying to clarify the implications of “I hate make” on all of this.

It’s already pretty stretched, but I completely agree that not stretching it further is the prudent course forward.

I’m glad you’ve narrowed down the scope here. My only (not strongly held) suggestion is that the tests should be under src/test 'cause I keep looking for them there.

If we just move current structure, then they’ll be a bit of a mess for a while, because stan puts things under src, but it doesn’t do that for the lib, because it’s brought in like other header libs, just under stan. So the math lib is just stan/math/... and test/math/..., whereas the stan lib is src/stan/math and src/test/unit/math.

I’m just clarifying. I’m all for minimal scope and doing this in the smallest stages that won’t leave us with a hosed process.

I like the idea! As we are in the process to throw things in the air… how about splitting out our libraries into a git submodule (the stuff under math/lib)? If you think that defeats the point of a mono-repo, then I understand. However, I have the impression that checking out and downloading stan would be much faster if we would not have to carry around all the libs all the time.