RFC: Return of the Monorepo

@wds15 — is checkout time an issue? There’s one big up front cost then a cost whenever there’s a big update, but I don’t find it slowing anything down myself.

1 Like

re: math’s libraries as submodules - I think we should actually use git subtrees if we do this, because they’re supposed to be a lot easier to work with and supposedly naturally support things like updating the library but keep some custom local commits applied through time. I think that just becomes a git subtree pull --prefix=lib/boost/ --squash boost v1.66.0. How do people feel about tacking that on to this proposal? It adds some work and thus increases the chance of stalling but it should be nicer going forward. I’m not sure if it really needs to happen at the same time though.

[edit] subtrees help a little with checkout time in that there’s no history to pull, but they still make a user check out the code. I’m not really sure when someone would check out the Math repo and not needs its libraries, though?

[edit2] For what it’s worth I suspect most of the checkout time is the commits in the current math repo that add and remove millions of lines of generated doc with each release :P

It’s probably a good idea to sort the lib issue out in another round, yes. Subtree’s sound good to me.

Here is the current size of the git’s:

154M	cmdstan.git
261M	math.git
265M	stan.git
680M	total

I do think that limiting the size is a worthwhile goal as this limits resource use… and given how much our codebase is thrown around (testing!!) this is worthwhile to consider.

Looking at the above sizes, I think that the libs add substantially to it (and stan has probably left-overs from being the past main repo and as such contained at some point the libs). What I have in mind is to do mini-releases of the libs whenever these change. This way the tagged released can be deployed once to the test systems and then would be reused each time.

If we could avoid to check-in the doc each time we release, then that would also be good improvement. Any auto-generated content should not blow up our git sizes. That won’t scale well.

… but back to the original post…

Math is only 183M in actual files, meaning about 80M is git history. Boost is 152M of the actual files. I’m not sure how you’d work with the Math library without boost, though? Like I don’t see how a submodule would address that.

I already changed the release process to avoid checking in generated files and a lot of other extraneous movement :D But there’s a lot of history of that still that we can get rid of when we switch.

If we have the libs in a submodule (or whatever else mechanism to separate it), then I would have a structure like

~/work/stan-math-base-libs
~/work/stan-mono-1
~/work/stan-mono-2
… etc.

Moreover, we can deploy the stan-math-base-libs a single time to the test-environments (hopefully) and then just point to where it is installed. We should be able to recycle the libraries, no? The libraries are very stable over time in comparison to the rest.

However, I don’t intend to complicate things here, but I am raising it as I think its more efficient.

+1

Oh, I see what you mean. It wouldn’t be a submodule in that case; that’s a specific git term referring to something else. I’m not in favor of separating it like that because I think most people will only have one clone of Stan and for those of us with 2 or 3, we can afford the extra 150-300MB.

1 Like

On Linux there 's often a “close enough” boost already installed so I’ve used that before

1 Like

We’ve had to post-patch Boost to remove spurious errors. And we have to keep up with updates to Boost and Eigen.

We use submodules for the math lib within stan and for stan within cmdstan. We’ve just been including Boost and Eigen directly.

We haven’t patched Boost recently. Some spurious errors are getting all the way through to R users. Some we can control with compiler flags because they’re lint type. There’s another discourse thread on that where Ben showed people how to config them away in the R Makevars file.

Just to be clear - the libraries are currently included as git submodules.

Just to clarify:
The Stan libraries (stan-dev/stan and stan-dev/math) are submodules. Other libraries like Boost, Eigen, and Sundials are just included in the source repository (not submodules).

Oh, right! My bad. Getting confused, sorry about that. This makes more sense now. :P

1 Like

Too many uses of the term “library”! I think if all our dependencies were on GitHub, we would have gone with submodules all the way down (at the time).

Really no reason why we can’t have the Stan-compatible trimmed version of Boost in stan-dev/boost as a submodule.

We could, yeah. From what I can tell, the only pro-side of that setup is that someone who has multiple checkouts of Math can point them all to the same Stan-specific boost and therefore save about 150MB per checkout. We shouldn’t, for example, support someone pointing to a local Linux distro-installed Boost that doesn’t have our changes and might not be the right version.

To me the disk space savings on the 2nd+ repo just doesn’t seem worth literally any development overhead - I’d rather mail people USB thumb drives or something…

This often works fine…

I think this only matters if your on connections like 10/5 or worse… which realistically is pretty common. The math library is pretty lightweight other than these deps.

That said I haven’t been involved in dealing with Boost/Eigen dev. overhead so I won’t push against the current state there.

Stan’s tied to the BH package in R. We use whichever version of Boost they’re up to. They’ve been good about adding extra components, but I think we’re on our own with the non header-only ones we’re using for MPI.

What we’ve done in the past is put our own version of a boost header file on the include path ahead of theirs so our header guard prevents theirs from being read. That lets us do things like remove the pesky error messages. I don’t think we’ve been doing that recently, though.

Are there specific problems in the std libs we need to address?

It’s particularly painful around the places where C99 added something but C++03 did not—the namespaces vary a lot with respect to which includes you use and whether things go into the top-level :: or std:: namespaces.

rstan is still doing it, but it has only been 9 commits in four years. Not a big deal, but hopefully can be shrunk when C++11 is turned on.

Okay, I think this deserves to be a separate thread and not tied to the monorepo proposal or implementation. Happy to let whoever feels strongly about this make the case there and propose their own plan for switching to git submodules for Boost or whatever other libraries people care about becoming submodules.