Schedule for Splitting Apart The Stan Repos?

With the talk of new dev projects we should discuss when we want to split apart the stan-dev/stan repo into services (perhaps to be left named ‘stan’), algorithms, and language. Should we wait until after the ongoing work to clean up the language? It seems like a good project to do while @seantalts is going through the dev-ops process as it might motivate clean up of the upstream/integration testing process.

Yes, please. I think we should move the existing stan-dev/stan code into three new repos:

  • stan-dev/language
  • stan-dev/algorithms
  • stan-dev/services

We can reserve the stan-dev/stan repo for top-level project stuff or get rid of it altogether in favor of one or more new admin/doc/feature-planning repos. We need a landing place for all the top-level wikis and for doc.

There’s a question of where the doc goes in all this. I’d like to add repo-specific doc and move it all to bookdown format:

  • language: user-facing language spec

  • algorithms: user-facing algorithm specs

  • services: I don’t know how much of this to merge with the algorithms specs—it’s really about what the command-line arguments mean

  • user’s guide: this will be the book (which we will be allowed to distribute online)

It’d also be nice if we had a global Stan bib repo that we all keep consistent for all of our papers, grants, and all this doc (bookdown uses BibTeX).

I think we also need dev-facing top-level doc for all of these projects, but that’s easy to add through a wiki.

I’d like to be involved in the doc re-org. Mitzi should be able to help with dev ops for the doc chains.

I don’t think all these longer-term goals (specifically all the doc into bookdown and figuring out where to put residual content from stan-dev/stan) should stop us from splitting into multiple repos.

We’re going to need one main repo that we can really consider as the C++ API to Stan (especially for the interfaces to leverage). Moreover, the “services” repo would have to include the algorithms as a submodule so it really wouldn’t make sense on it’s own, so we should keep that one as stan-dev/stan.

But this also raises a tricky problem. Right now lots of tests in algorithms require C++ Stan models which are automatically parsed when the tests are built. If we separate the language from algorithms then we can no longer leverage this useful feature. At the same time the algorithms can’t even use the C++ models without the log_prob implementations.

So perhaps we really need something like

stan <- algorithms <- language (parser + model spec + autodiff wrappers)?

Does the repo need to be split apart to divide up responsibility for components? Is the main attraction that there would be clear separation in the issue and PR management?

There’s a lot to be said for being able to refer to one (well, now two with math) git commit hashes and completely specify a distribution of Stan.

1 Like

Splitting things out does help identify management responsibilities, but it also reduces testing burdens. At the very least the language code really should be in its own repository as it could be used by itself for anyone wanting to parse Stan programs down into C++ for use with their own algorithms.

Also, doesn’t a stan-dev/stan hash include hashes for the submodule hashes?

We can split things out into separate logical modules without creating separate repos, so I think we should treat the two discussions orthogonally. Is the main motivation for a separate repo that there could exist users who would use JUST the Stan compiler without any of the algorithms or math that its generated code relies on?

In my experience (having worked at a bunch of places, some with mono, some with tiny, and some in between) it’s extremely nice to have a single historical record vs. trying to collate amongst multiple historical records and piece things together. This comes up most often when debugging. I’ve had this conversation in quite a few different contexts and I think Dan Luu summarizes the other advantages pretty well and links to a bunch of other discussion of the issue here: http://danluu.com/monorepo/

1 Like

Separate use but also separate testing. For example, splitting out the math repo drastically reduced the testing burden on the other repositories. If we want people to use the Stan modeling language without our algorithms then they’d need to be able to grab the language and math code without having to pull in the algorithms and the services that support them.

We could have separate testing without separate repos… I think I’m missing something - why do stanc-only users require not pulling the e.g. algorithms code down when they try to use just stanc? I get that there could be a few KB saved on github’s behalf but otherwise?

I’m sold on the monorepo. It would, however, be nice if one could delegate control over specific trees (i.e., subdirectories) to specific people. I gather this is how development on the monorepo at Google is done, for instance (see https://arxiv.org/abs/1702.01715)

1 Like

I think we’re maybe small enough now that we could have that be a manual rule; we could record it as they do in a file in the subdirectory and then just socially require that one of those subdir owners approve pull requests relating to that subtree.

The over-arching goal is the standard goal in all of software engineering: modularity. The reason we want modularity is to reduce the complexity of what people have to deal with in terms of coding, design and doc by enforcing clean boundaries.

I can believe submodules aren’t the only way to do this.

Working back from our user-facing distribution goals, we need separate releases for all of:

  • Stan math (includes Boost and Eigen)

  • Stan language (includes math)

  • RStan (includes services

  • CmdStan (includes services)

  • … other interfaces …

The other logical modules we have are the following.

  • algorithms (includes language)

  • services/commands (includes algorithms)

The interfaces, in their role as clients of Stan infrastructure, need to call the services layer as well as the language compiler.

From a developer perspective, I want to be able to work on just one of these bulleted items and test it independently. The problems we wind up having is when we need to make synchronized changes to stan-dev/math and stan-dev/stan or to stan-dev/stan and stan-dev/rstan.

I looked at the linked article Sean sent, and it left me curious as to just what the scope of these monorepos are they talked about. I have a hard time believing that all of Google’s code base is in one repo, unless they mean something else by the word. So where do they decide to break in terms of project scale?

I take the point about having to duplicate code across repos. It’s terrible having all that makefile and Jenkins config duplicated (though I don’t know how easy it would be to have a master make for all of Stan).

1 Like

Apparently github has subdirectory-level permissions built-in: https://help.github.com/articles/about-codeowners/

I read through that, but didn’t see anything about permissions. By permissions, I’m talking about the ones listed in the Repository permission level for an organization help page:

  • Owner
  • Admin
  • Write
  • Read

Codeowners doesn’t seem to lock down permissions down to a subdirectory based on the doc. By that, I mean have certain members be Admin on a subdirectory, Write on some other subdirectory, and Read on another. The way it’s set up, we’d have to enforce by convention: anyone with Write permissions on the repo can merge any pull request on that repo.

Was there something in the doc that I missed?

The part about required reviewers is basically write permissions on a per-subdirectory basis: https://help.github.com/articles/enabling-required-reviews-for-pull-requests/

Ah. I see. I missed that link the first time through. But it’s not locking
down write permissions. In essence, the set of all people that would work
on any of the pieces would have Write access to the repo, but then we’d
limit that with the CODEOWNERS file and requiring code review from code
owners. I think that effectively has the behavior we’d want. Still seems
simpler to lock down the repo?

I’m not sure if I think full repos are simpler even if everything else were equal… But luckily there are a lot of other factors that weigh more heavily than that difference to me.

1 Like