Schedule for Splitting Apart The Stan Repos?

betanalpha · February 14, 2017, 12:48am

With the talk of new dev projects we should discuss when we want to split apart the stan-dev/stan repo into services (perhaps to be left named ‘stan’), algorithms, and language. Should we wait until after the ongoing work to clean up the language? It seems like a good project to do while @seantalts is going through the dev-ops process as it might motivate clean up of the upstream/integration testing process.

Bob_Carpenter · February 14, 2017, 6:21pm

Yes, please. I think we should move the existing stan-dev/stan code into three new repos:

stan-dev/language
stan-dev/algorithms
stan-dev/services

We can reserve the stan-dev/stan repo for top-level project stuff or get rid of it altogether in favor of one or more new admin/doc/feature-planning repos. We need a landing place for all the top-level wikis and for doc.

There’s a question of where the doc goes in all this. I’d like to add repo-specific doc and move it all to bookdown format:

language: user-facing language spec
algorithms: user-facing algorithm specs
services: I don’t know how much of this to merge with the algorithms specs—it’s really about what the command-line arguments mean
user’s guide: this will be the book (which we will be allowed to distribute online)

It’d also be nice if we had a global Stan bib repo that we all keep consistent for all of our papers, grants, and all this doc (bookdown uses BibTeX).

I think we also need dev-facing top-level doc for all of these projects, but that’s easy to add through a wiki.

I’d like to be involved in the doc re-org. Mitzi should be able to help with dev ops for the doc chains.

I don’t think all these longer-term goals (specifically all the doc into bookdown and figuring out where to put residual content from stan-dev/stan) should stop us from splitting into multiple repos.

betanalpha · February 16, 2017, 2:26am

We’re going to need one main repo that we can really consider as the C++ API to Stan (especially for the interfaces to leverage). Moreover, the “services” repo would have to include the algorithms as a submodule so it really wouldn’t make sense on it’s own, so we should keep that one as stan-dev/stan.

But this also raises a tricky problem. Right now lots of tests in algorithms require C++ Stan models which are automatically parsed when the tests are built. If we separate the language from algorithms then we can no longer leverage this useful feature. At the same time the algorithms can’t even use the C++ models without the log_prob implementations.

So perhaps we really need something like

stan <- algorithms <- language (parser + model spec + autodiff wrappers)?

ariddell · February 16, 2017, 7:59pm

Does the repo need to be split apart to divide up responsibility for components? Is the main attraction that there would be clear separation in the issue and PR management?

There’s a lot to be said for being able to refer to one (well, now two with math) git commit hashes and completely specify a distribution of Stan.

betanalpha · February 16, 2017, 8:24pm

Splitting things out does help identify management responsibilities, but it also reduces testing burdens. At the very least the language code really should be in its own repository as it could be used by itself for anyone wanting to parse Stan programs down into C++ for use with their own algorithms.

betanalpha · February 16, 2017, 8:24pm

Also, doesn’t a stan-dev/stan hash include hashes for the submodule hashes?

seantalts · February 16, 2017, 8:58pm

We can split things out into separate logical modules without creating separate repos, so I think we should treat the two discussions orthogonally. Is the main motivation for a separate repo that there could exist users who would use JUST the Stan compiler without any of the algorithms or math that its generated code relies on?

In my experience (having worked at a bunch of places, some with mono, some with tiny, and some in between) it’s extremely nice to have a single historical record vs. trying to collate amongst multiple historical records and piece things together. This comes up most often when debugging. I’ve had this conversation in quite a few different contexts and I think Dan Luu summarizes the other advantages pretty well and links to a bunch of other discussion of the issue here: http://danluu.com/monorepo/

betanalpha · February 16, 2017, 9:31pm

Separate use but also separate testing. For example, splitting out the math repo drastically reduced the testing burden on the other repositories. If we want people to use the Stan modeling language without our algorithms then they’d need to be able to grab the language and math code without having to pull in the algorithms and the services that support them.

seantalts · February 16, 2017, 9:39pm

We could have separate testing without separate repos… I think I’m missing something - why do stanc-only users require not pulling the e.g. algorithms code down when they try to use just stanc? I get that there could be a few KB saved on github’s behalf but otherwise?

ariddell · February 16, 2017, 9:53pm

I’m sold on the monorepo. It would, however, be nice if one could delegate control over specific trees (i.e., subdirectories) to specific people. I gather this is how development on the monorepo at Google is done, for instance (see https://arxiv.org/abs/1702.01715)

seantalts · February 17, 2017, 3:34pm

I think we’re maybe small enough now that we could have that be a manual rule; we could record it as they do in a file in the subdirectory and then just socially require that one of those subdir owners approve pull requests relating to that subtree.

Bob_Carpenter · February 17, 2017, 8:19pm

The over-arching goal is the standard goal in all of software engineering: modularity. The reason we want modularity is to reduce the complexity of what people have to deal with in terms of coding, design and doc by enforcing clean boundaries.

I can believe submodules aren’t the only way to do this.

Working back from our user-facing distribution goals, we need separate releases for all of:

Stan math (includes Boost and Eigen)
Stan language (includes math)
RStan (includes services
CmdStan (includes services)
… other interfaces …

The other logical modules we have are the following.

algorithms (includes language)
services/commands (includes algorithms)

The interfaces, in their role as clients of Stan infrastructure, need to call the services layer as well as the language compiler.

From a developer perspective, I want to be able to work on just one of these bulleted items and test it independently. The problems we wind up having is when we need to make synchronized changes to stan-dev/math and stan-dev/stan or to stan-dev/stan and stan-dev/rstan.

I looked at the linked article Sean sent, and it left me curious as to just what the scope of these monorepos are they talked about. I have a hard time believing that all of Google’s code base is in one repo, unless they mean something else by the word. So where do they decide to break in terms of project scale?

I take the point about having to duplicate code across repos. It’s terrible having all that makefile and Jenkins config duplicated (though I don’t know how easy it would be to have a master make for all of Stan).

seantalts · November 7, 2017, 3:40pm

Apparently github has subdirectory-level permissions built-in: https://help.github.com/articles/about-codeowners/

syclik · November 7, 2017, 4:49pm

I read through that, but didn’t see anything about permissions. By permissions, I’m talking about the ones listed in the Repository permission level for an organization help page:

Owner
Admin
Write
Read

Codeowners doesn’t seem to lock down permissions down to a subdirectory based on the doc. By that, I mean have certain members be Admin on a subdirectory, Write on some other subdirectory, and Read on another. The way it’s set up, we’d have to enforce by convention: anyone with Write permissions on the repo can merge any pull request on that repo.

Was there something in the doc that I missed?

seantalts · November 7, 2017, 7:18pm

The part about required reviewers is basically write permissions on a per-subdirectory basis: https://help.github.com/articles/enabling-required-reviews-for-pull-requests/

syclik · November 7, 2017, 7:58pm

Ah. I see. I missed that link the first time through. But it’s not locking
down write permissions. In essence, the set of all people that would work
on any of the pieces would have Write access to the repo, but then we’d
limit that with the CODEOWNERS file and requiring code review from code
owners. I think that effectively has the behavior we’d want. Still seems
simpler to lock down the repo?

seantalts · November 7, 2017, 8:13pm

I’m not sure if I think full repos are simpler even if everything else were equal… But luckily there are a lot of other factors that weigh more heavily than that difference to me.

Topic		Replies	Views
RFC: Return of the Monorepo Developers	59	2855	October 29, 2018
Stan Governance General	109	6261	November 7, 2017
Moving stan2tfp out of stanc3's repo Developers stanc	16	1232	October 26, 2021
Reimplementing the inference algorithms Algorithms	18	2639	January 19, 2022
Helpful function repository General	11	932	January 14, 2021

Schedule for Splitting Apart The Stan Repos?

Related topics