RFC: Return of the Monorepo

Absolutely! Running git log --reverse in the Stan repo shows emails like this:
bearlee@alum.mit.edu@9a304bd9-dce1-f7c0-8d5c-bb1642157d4e

The part after my email is the original svn hash, I think.

I’m thinking about stuff like:

  • how are pull requests that span multiple repos handled?
  • how do we tell committers and reviewers that they have to be more careful about scoping the contributions more carefully since it can actually affect other repos?
  • who merges?
  • how are PRs triaged?
  • what labels will there be on issues and pull requests?
  • what do we do about all the existing PRs and open issues?
  • will we have one common standard across the mono-repo? If not, then why wouldn’t the other interfaces be part of this?

Not all of that has to be answered. I’d just like to know that the effect of this change has been thought through reasonably before embarking on it. I’m still supportive of it.

You mean migrating PRs that currently span multiple repos into the monorepo? I think a decent answer here would be to just blindly automatically copy and then ask authors of any PRs that span to merge them themselves. I don’t think we have any open right now so that’s good.

We can still use the wiki to communicate with contributors and potential contributors about our processes and guidelines. I’m not sure what you mean here by “affect other repos” - currently they affect other repos but they actually won’t really anymore once we move to a single repo.

I meant the permissions section to address this - we can use CODEOWNERS to say who has permission to merge for a given directory.

How are they triaged now? It shouldn’t change - the only thing changing is that we’re going from 3 versions of history, issues, releases, and wikis to a single one.

That’s in the doc above - we can have labels for the major components (“language,” “algorithms,” etc) we expect the issue or PR to touch. If we’re okay with that I’d propose batch tagging the existing ones with their appropriate repo tag so we retain the original repo designation in case we need it later, if nothing else.

Migrate all branches and issues (open and closed) in a scripted fashion over to the new repo. This might require writing a new script with the web API (or using something existing). I think this will likely end up collapsing the discussion into the text of the original issue rather than as individual comments (since we can’t comment as other people).

I think we pretty much have a common standard across the 3 repos now, yeah (to the extent that we don’t it’s been more because they are in separated repos and tooling like autoformatting needs to be modified for each repo). Do you disagree with that? I’m thinking here of standards like “nearly all pull requests should be reviewed by someone,” “Google style guide and autoformatting is good,” etc. I think this is one reason I view the interfaces as being independent and don’t think we need them in. I think this makes more sense as you try to generalize this to interfaces like ScalaStan, which is developed mostly for and by one company (so far) to their own standards. I think this federated nature is pretty common for wrappers and that they naturally should have less coupling than e.g. Stan and Math naturally have. CmdStan is sort of a default example interface for now, until we make it into ServerStan or whatever, in which case it also goes inside the API dotted line and interfaces just interact with that.

Thanks for thinking this through and writing it up. I’d completely forgotten about things like the issue trackers and history in only thinking about the target workflow.

First, I’m in favor of combining math, stan, and cmdstan into a single repo as you suggest.

Second, I like the proposed directory structure (and would propose not combining the scal/arr/mat flattening into the same move) for the source.

Wiki and README

These need to move, but most of them are already in stan. Those need a housecleaning, too.

License

This is all consistent across these three repos.

Unit tests

Are we going again with something like

src/stan/{math, language, algorithms, services, cmdstan}
test/unit/{math, language, algorithms, services, cmdstan}

I was never particularly happy with that because of lack of parallelism.

Upstream tests

Are there any upstream tests of RStan and PyStan from within Stan at the moment?

Makefiles

Will there be one top-level makefile or multiple ones? Where will they live?

Doc

Same question about where it goes.

What do you mean about parallelism? Any suggestions for alternatives?

I hate make; it doesn’t really have tools for this as far as I know. I propose a top-level Makefile and make folder because I think we can re-use a lot of code. We’ll need new variables for any C++ flags that are CmdStan, Stan, or Math specific. If anyone (esp @syclik) feels like we should keep the Makefiles modular and not have a top-level, I would also be fine with that especially since it’s much easier to implement and makes a logical first step.

Doc I will propose goes in doc/{math, language, algorithms, services, cmdstan} but again, I don’t have a strong belief.

In some sense I am the wrong person to propose the reorganization of the source code and other directories - I really care most about the git repos, issues, releases, and wiki being consolidated and not much about alternative directory structures within that.

I mean that ione has src/stan and the other test/unit whereas they’re both source and they’re both in the Stan hierarchy in some sense.

Do you want to hold this up while figuring out a replacement?

Could you roll this back into a unified proposal? I just think it’ll then be easier to bring other people in and tell them this is what’s happening. I don’t think anyone will object.

I’m not that picky as long as it’s consistent.

I suspect it will impinge on how easy it will be to release just the math library or manage permissions.

If we have src/{math, stan}, test/{math, stan}, doc/{math, stan} etc., then we’re spreading what we’re thinking of as an independent module, namely the math library, throughout the whole repo.

If we have math/{src, doc, test}, stan/{src, doc, test}, etc., then we keep the modules together, but not the functionality of doc, src, testing etc. The goal shold perhaps be to minimize all the complicated relative paths back and forth.

I see about parallelism.

A lot of proposals get screwed by letting perfect be the enemy of good. I bet I could find a circular dependency chain here (and in most of them) pretty easily, too. I actually want to revise my proposal to be more minimal - restructure into the directories you listed and maintain doc, build system, test, etc modularity within each directory. I think I will edit the original post to reflect this and add a note that I did that. This will be easier and makes more sense as a first step - we can always refactor the makefiles later on. We have enough work cut out for us in just merging the history, issues, wiki, and releases without requiring that we totally refactor and fix our build system too. I think this should even extend to Jenkins, though travis doesn’t support a modular system like this so we’ll have to work to combine them into a single travis file.

1 Like

@Bob_Carpenter, we can fix this easily. Let’s open up an issue for this separately. (There were always two considerations: 1) specifying exactly how everything gets laid out and 2) this would wreak havok on your navigation of the tests for a while. I tried not to move files and invocation of tests lightly.)

Clarification on what tools are needed?

I don’t like make. Please replace. But verify that it works before swapping it out. There have been a number of incomplete attempts and dropping in something broken is worse than having make.

I support your updated proposal. The first thing should be about putting together the monorepo with the transition being as smooth as possible for the developers and users. Things in the repos should change minimally and only to support this action.

1 Like

I don’t want to tie monorepo to replacing our build system, and yeah I want to do that carefully if we ever attempt it. There’s a cmake branch on Math that Dan Luu estimated needed another week of work, but it doesn’t seem like a high priority(?)

I was just referring to make’s lack of tools for subprojects. cmake has its own devils but has this built in.

1 Like

Yes, it doesn’t have that capability (at least not built-in). I’m weary of stretching make beyond what it’s good for – that’s caused us a lot of trouble in the past (not make itself).

Exactly why I asked :-) This is really the reason to write functional specs out—just to scope out what’s being done and prioritize it all.

Sounds good. That’s halfway to one of the original extremes of just copying over directories exactly where everything’s at now. We could also just do that and do this whole thing in stages. I think you and @syclik should have a better handle than me on how hard all the build stuff would be to move.

I don’t think it’s even ready for an issue because I don’t know what the best way to do it is. I’m totally OK leaving things as they are now and refactoring into one big repo in stages that are as small as possible.

I think that’s the right decision. I was just trying to clarify the implications of “I hate make” on all of this.

It’s already pretty stretched, but I completely agree that not stretching it further is the prudent course forward.

I’m glad you’ve narrowed down the scope here. My only (not strongly held) suggestion is that the tests should be under src/test 'cause I keep looking for them there.

If we just move current structure, then they’ll be a bit of a mess for a while, because stan puts things under src, but it doesn’t do that for the lib, because it’s brought in like other header libs, just under stan. So the math lib is just stan/math/... and test/math/..., whereas the stan lib is src/stan/math and src/test/unit/math.

I’m just clarifying. I’m all for minimal scope and doing this in the smallest stages that won’t leave us with a hosed process.

I like the idea! As we are in the process to throw things in the air… how about splitting out our libraries into a git submodule (the stuff under math/lib)? If you think that defeats the point of a mono-repo, then I understand. However, I have the impression that checking out and downloading stan would be much faster if we would not have to carry around all the libs all the time.

@wds15 — is checkout time an issue? There’s one big up front cost then a cost whenever there’s a big update, but I don’t find it slowing anything down myself.

1 Like

re: math’s libraries as submodules - I think we should actually use git subtrees if we do this, because they’re supposed to be a lot easier to work with and supposedly naturally support things like updating the library but keep some custom local commits applied through time. I think that just becomes a git subtree pull --prefix=lib/boost/ --squash boost v1.66.0. How do people feel about tacking that on to this proposal? It adds some work and thus increases the chance of stalling but it should be nicer going forward. I’m not sure if it really needs to happen at the same time though.

[edit] subtrees help a little with checkout time in that there’s no history to pull, but they still make a user check out the code. I’m not really sure when someone would check out the Math repo and not needs its libraries, though?

[edit2] For what it’s worth I suspect most of the checkout time is the commits in the current math repo that add and remove millions of lines of generated doc with each release :P

It’s probably a good idea to sort the lib issue out in another round, yes. Subtree’s sound good to me.

Here is the current size of the git’s:

154M	cmdstan.git
261M	math.git
265M	stan.git
680M	total

I do think that limiting the size is a worthwhile goal as this limits resource use… and given how much our codebase is thrown around (testing!!) this is worthwhile to consider.

Looking at the above sizes, I think that the libs add substantially to it (and stan has probably left-overs from being the past main repo and as such contained at some point the libs). What I have in mind is to do mini-releases of the libs whenever these change. This way the tagged released can be deployed once to the test systems and then would be reused each time.

If we could avoid to check-in the doc each time we release, then that would also be good improvement. Any auto-generated content should not blow up our git sizes. That won’t scale well.

… but back to the original post…

Math is only 183M in actual files, meaning about 80M is git history. Boost is 152M of the actual files. I’m not sure how you’d work with the Math library without boost, though? Like I don’t see how a submodule would address that.

I already changed the release process to avoid checking in generated files and a lot of other extraneous movement :D But there’s a lot of history of that still that we can get rid of when we switch.

If we have the libs in a submodule (or whatever else mechanism to separate it), then I would have a structure like

~/work/stan-math-base-libs
~/work/stan-mono-1
~/work/stan-mono-2
… etc.

Moreover, we can deploy the stan-math-base-libs a single time to the test-environments (hopefully) and then just point to where it is installed. We should be able to recycle the libraries, no? The libraries are very stable over time in comparison to the rest.

However, I don’t intend to complicate things here, but I am raising it as I think its more efficient.

+1

Oh, I see what you mean. It wouldn’t be a submodule in that case; that’s a specific git term referring to something else. I’m not in favor of separating it like that because I think most people will only have one clone of Stan and for those of us with 2 or 3, we can afford the extra 150-300MB.

1 Like

On Linux there 's often a “close enough” boost already installed so I’ve used that before

1 Like