We discussed this in the last Stan meeting, but I just wanted to see if there are any extra thoughts on this.
One problem now with the posteriordb is that if we install the R package from github, the R package is essentially cloning the whole repository from github. This will become infeasible when posteriordb is growing, something that has been identified by @avehtari .
My suggestion is to move the R package and the python lib to a separate repository called posteriordbtools.
One suggestion that came up was that we might just want to reduce the size of posteriordb by moving out large files to a separate database. I think that is a good idea in the long term, but short term I think it is important that we do not break anyone’s code (that is not reinstalling the R package or Py lib).
Very good point. I have not thought so much about this. I guess it better to split then into two different repos right away? Or what do you think?
I’m not sure we need to agree on what to support for both? I would expect the R package to be more developed in the short term, I guess that is ok since the main content is rather the posteriordb repo. The packages are just for convenience?
I’m not opposed to having them in separate repos, but if you want to keep them together it’s possible to install from github and avoid cloning the whole repo by uploading built versions of the package to stan-dev/r-packages. This is how we recommend people install cmdstanr, so we know it works well. You would just have to change the installation instructions to
I think this is a good idea anyway. This will also solve the issue of installing the package. Although it comes with some additional maintenance, I guess? Or can I include this automatically in github actions?
An argument for separating out the packages is also that I now run a lot of test suites on the content of the database even if I just do a small change in the package. So separating the packages would also reduce the test runs. This might be possible to do anyway somehow, but Im not sure how to do it in an R CI workflow. It would also simplify installing a dev package version from github directly…
I do not really have a big opinion on this, but I’m still leaning toward separate repos for separate packages (and adding the package to stan-dev/r-packages). Any thoughts @avehtari or @jonah ?
I think separating them is a good thing as the database part is changing rarely and many users don’t need to download it at all as they can just access the specific data they are querying, or if they clone, we could also recommend to clone only with treedepth 1. Installing the code part of posteriordb from git directlt is probably still going to be used by those who want to test development versions.
I agree that splitting it now is the best idea. Not as much for the short term, where Jonah’s idea would probably be the best, but eventually the actual database will be too big for Github, at least I think that is your plan? It could be moved off of Github or to Github LFS or some other actual online database system. And if we split it now, that makes that step in the future much easier.
Yes, that would be easy. Adding or upgrading a package in r-packages is as easy as:
#clone r-packages repo locally
devtools::build() #build the package
drat::insertPackage("path/to/package_tarball.tar.gz", "path/to/r-packages/")
# and then push the changes to the repo with git
We will also add a Github Action with which you will be able to upgrade or add and build a package in the stan-dev organization and push it in the r-packages repo with just a few clicks in the browser.