RFC: Return of the Monorepo


#21

@wds15 — is checkout time an issue? There’s one big up front cost then a cost whenever there’s a big update, but I don’t find it slowing anything down myself.


#22

re: math’s libraries as submodules - I think we should actually use git subtrees if we do this, because they’re supposed to be a lot easier to work with and supposedly naturally support things like updating the library but keep some custom local commits applied through time. I think that just becomes a git subtree pull --prefix=lib/boost/ --squash boost v1.66.0. How do people feel about tacking that on to this proposal? It adds some work and thus increases the chance of stalling but it should be nicer going forward. I’m not sure if it really needs to happen at the same time though.

[edit] subtrees help a little with checkout time in that there’s no history to pull, but they still make a user check out the code. I’m not really sure when someone would check out the Math repo and not needs its libraries, though?

[edit2] For what it’s worth I suspect most of the checkout time is the commits in the current math repo that add and remove millions of lines of generated doc with each release :P


#23

It’s probably a good idea to sort the lib issue out in another round, yes. Subtree’s sound good to me.

Here is the current size of the git’s:

154M	cmdstan.git
261M	math.git
265M	stan.git
680M	total

I do think that limiting the size is a worthwhile goal as this limits resource use… and given how much our codebase is thrown around (testing!!) this is worthwhile to consider.

Looking at the above sizes, I think that the libs add substantially to it (and stan has probably left-overs from being the past main repo and as such contained at some point the libs). What I have in mind is to do mini-releases of the libs whenever these change. This way the tagged released can be deployed once to the test systems and then would be reused each time.

If we could avoid to check-in the doc each time we release, then that would also be good improvement. Any auto-generated content should not blow up our git sizes. That won’t scale well.

… but back to the original post…


#24

Math is only 183M in actual files, meaning about 80M is git history. Boost is 152M of the actual files. I’m not sure how you’d work with the Math library without boost, though? Like I don’t see how a submodule would address that.

I already changed the release process to avoid checking in generated files and a lot of other extraneous movement :D But there’s a lot of history of that still that we can get rid of when we switch.


#25

If we have the libs in a submodule (or whatever else mechanism to separate it), then I would have a structure like

~/work/stan-math-base-libs
~/work/stan-mono-1
~/work/stan-mono-2
… etc.

Moreover, we can deploy the stan-math-base-libs a single time to the test-environments (hopefully) and then just point to where it is installed. We should be able to recycle the libraries, no? The libraries are very stable over time in comparison to the rest.

However, I don’t intend to complicate things here, but I am raising it as I think its more efficient.

+1


#26

Oh, I see what you mean. It wouldn’t be a submodule in that case; that’s a specific git term referring to something else. I’m not in favor of separating it like that because I think most people will only have one clone of Stan and for those of us with 2 or 3, we can afford the extra 150-300MB.


#27

On Linux there 's often a “close enough” boost already installed so I’ve used that before


#28

Aside from the question of where the libraries end up in the tree, libraries as a submodule is appealing because that will likely stop thousands of boost or eigen results from showing up when I search the repo from github.com. This is confusing for new contributors who are trying to orient themselves because the relevant stan results are often buried. From reading old threads, I hear rumblings about patches to boost, (and possibly Eigen), but I wonder if those patches are actually (required to be) made within the boost/Eigen sources, or do they (can they) just add appropriately namespaced declarations and template specializations from stan’s own sources?


#29

We’ve had to post-patch Boost to remove spurious errors. And we have to keep up with updates to Boost and Eigen.


#30

submodules allow pointing to specific versions of remote repos, so updates can be controlled from stan’s side. I noticed you used past tense in “we’ve had to post-patch”. Does that mean it isn’t currently done?


#31

We use submodules for the math lib within stan and for stan within cmdstan. We’ve just been including Boost and Eigen directly.

We haven’t patched Boost recently. Some spurious errors are getting all the way through to R users. Some we can control with compiler flags because they’re lint type. There’s another discourse thread on that where Ben showed people how to config them away in the R Makevars file.


#32

Understood; thank you.


#33

Just to be clear - the libraries are currently included as git submodules.


#34

Just to clarify:
The Stan libraries (stan-dev/stan and stan-dev/math) are submodules. Other libraries like Boost, Eigen, and Sundials are just included in the source repository (not submodules).


#35

Oh, right! My bad. Getting confused, sorry about that. This makes more sense now. :P


#36

Too many uses of the term “library”! I think if all our dependencies were on GitHub, we would have gone with submodules all the way down (at the time).


#37

Really no reason why we can’t have the Stan-compatible trimmed version of Boost in stan-dev/boost as a submodule.


#38

We could, yeah. From what I can tell, the only pro-side of that setup is that someone who has multiple checkouts of Math can point them all to the same Stan-specific boost and therefore save about 150MB per checkout. We shouldn’t, for example, support someone pointing to a local Linux distro-installed Boost that doesn’t have our changes and might not be the right version.

To me the disk space savings on the 2nd+ repo just doesn’t seem worth literally any development overhead - I’d rather mail people USB thumb drives or something…


#39

This often works fine…

I think this only matters if your on connections like 10/5 or worse… which realistically is pretty common. The math library is pretty lightweight other than these deps.

That said I haven’t been involved in dealing with Boost/Eigen dev. overhead so I won’t push against the current state there.


#40

BTW if I read what Bob said correctly, regular boost with a version = the supported one is the boost stan uses (because stan hasn’t patched boost recently). Also, it makes sense that usung the appropriate macro defines, namespace declarations, and template specializations that stan can have a very large influence on Boost without patching Boost’s source… (Besides, stan doesn’t source patch those meddlesome compiler standard library versions, which are a source of problems at least for me, lol)