Promoting posteriordb into an official Stan project

The posteriordb project has matured and @avehtari would like to make it an official Stan project, which implies that:

  • The repository will move to the stan-dev organization
  • The project will be governed as part of Stan, following the same processes, CoC etc. as other Stan projects

Links to info about posteriordb:

We’ll be happy to hear everybody’s opinion on the move or any questions you have.

While we hope to easily achieve consensus on the move, we would - in the end - like to record the consensus in an official vote (after discussion ends and potential concerns were addressed). Part of the motivation for the vote is to just have a track of this decision and second part of the motivation is to test the new voting process in a (hopefully) low-stakes setting. The period for voting will start once the discussion here ends.

Tagging @Stan_Development_Team as this cannot be constrained to a single module and is thus a developer-wide decision.

20 Likes

Thanks for setting this up Martin. I’m excited to test out the new voting process.

And thanks @avehtari for proposing the vote and @mans_magnusson for leading the posteriordb effort. I’m definitely in favor of including this in the stan-dev organization.

And we’ll make sure everyone is notified when it comes time to vote.

6 Likes

Sounds like a good idea to me!

3 Likes

Also think it’s a good idea!

3 Likes

I would go farther and say that it is essential for things like posteriordb (that work with Stan output but are not necessarily limited to Stan output) to become official Stan projects in order to expand the universe of Stan developers to more people like @mans_magnusson . I think that should be a third implication of approving posteriordb but if it requires a separate vote, then we should do that right afterward.

10 Likes

Yeah I agree with all of that and I think adding Måns as a developer would be a natural implication of making posteriodb an official Stan project. So I don’t think we need a separate vote for that if this vote passes.

5 Likes

Given the aligned goal of principled Bayesian computation this is a natural project to officially include within the Stan project. That said I personally think that a few changes would be necessary to establish the right precedents.

  • Mild refactoring of the code into common resources (models, gold runs, etc) and parallel R and python interfaces. This would help emphasize the language agnostic role of the database while also facilitating separate contributions to the base database and the interfaces. We need to do better at emphasizing that the Stan project is not just RStan and the accompanying R environment.

  • The summarize_draws summary needs to change as it does not properly reflect the structure of the MCMC CLT. In particular while ess_bulk and ess_tail are fine extras (although ideally they would be optional given how expensive they are non-cached fits) the summary needs ess and mcmc_se, ideally in prominent locations near at the mean/MCMC estimator.

  • We really shouldn’t include explicit references to other packages to avoid the maintenance overhead, especially if posteriordb becomes more popular with other projects. Plugin functionality or something similar would be an appropriate alternative.

4 Likes

Thanks for the feedback.

posteriordb already has parallel R and Python interfaces (Python information at https://github.com/MansMeg/posteriordb/tree/master/python).

In addition of supporting R, RStan and CmdStanR, posteriordb already supports PyStan, CmdStanPy and PyMC3, and we are in progress of adding Tensorflow Probabilities, Pyro and Turing.jl. We have started discussions with recently published PPL Bench and Inference gym developers to collaborate to further speed up the language agnostic progress, improve compatibility, and share best practices for the use cases.

As per modularity summarize_draws comes from posterior package and in Python and Julia the corresponding functionality comes from ArviZ. Thus it would be useful to separate the discussion of the specific functions imported from other packages and what computations would be good in different use cases.

As the discussion of posterior and ArviZ packages and default outputs would deserve their own thread, I just briefly mention that posterior::summarize_draws computes MCSE for all mean, median, sd, mad, q5, and q95 and they are readily available. You can easily choose with posterior::summarize_draws arguments which summary quantities to show (we have also considered to allow ggplot2 style themes, so that everyone could globally set their favorite default output). ArviZ has similar functionality. I’m not sure what you mean by non-cached fits, but I guess you mean that some summary statistics could be cached. I don’t think the speed would be an issue in posteriordb, but maybe in some use cases we could do caching of summary statistics. ess_bulk has the benefit that it is defined also when CLT doesn’t hold due to the infinite variance. ess_tail is useful as people are often interest in tail quantities, too. For specific quantities of interest we recommend looking eventually corresponding MCSEs.

Reduction of maintenance overhead has been a guiding principle in using modular approach, and thus we try put all convergence, MCSE, etc. computations in one place, ie. posterior package in R and ArviZ in Python. ArviZ has automated tests checking the computation against posterior package to make certain that the results whether using posteriordb in R or Python should be the same.

5 Likes

But they are not parallel within the organization of the project. Right now posteriordb is organized as an R project with an extra Python interface, which as I understand it matches its development history. But to really be inclusive I recommend a literal reorganization into something like

> common_assets
> r_interface
> python_interface

That’s all well and good but it also introduced dependencies that aren’t appropriate for Stan. If the goal is to produce a test bed for all probabilistic programming languages and not just Stan then why include this as an official Stan project?

Modularity is a concern independent of dependencies. In this case you’re introducing lots of external dependences which would become Stan dependences, and hence maintenance concerns.

As I understand it posteriordb is meant to provide a testbed for proper probabilistic computation, which is limited to estimating expectation values. From that perspective the estimation of expectation values should be front and center – any summary compatible with this perspective should present well-defined expandands (i.e. functions from the ambient space to the real numbers/random variables) and information about their expectation (estimate, standard error, and if anything else variance and ESS that go into the standard error). If the expectation and/or variance don’t exist then the expectand shouldn’t be displayed, or be displayed separately.

Any general summary of the sample ensemble, including summaries of the pushforward of the target along the expectand such as quantiles and the heuristic bulk/tail esses, isn’t relevant to the stated computational validation goal. Indeed as you note that summarizing functionality can be provided by other packages and for proper modularity posteriordb would just need to provide API access to the latent samples.

As I noted above modularity is not the same thing as dependencies. Right now many of these other packages depend on Stan projects but Stan projects largely don’t depend on the other packages, which minimizes development overhead for Stan.

Hi all,

This post is 10 days old and I want to get a sense of whether all the positions have been expressed before moving this issue to vote. I’ll be wrapping this thread in the next four days (Nov 6th) as the SGB thinks that two weeks of discussion might be a reasonable timeframe in this topic. In the meantime, I will ask you to please state clearly your agreements/disagreements. If items are still pending after the fourth day, let’s handle them during voting, unless you all feel you need additional days for discussion.

Thank you everyone for all your contributions/feedback.

4 Likes

I think some of the issues are valid and there are good suggestions for further development of posteriordb, but I would really like to separate two questions:
a) Should posteriordb become an official Stan project
b) How should posteriordb be implemented/structured/…

I don’t think it makes a lot of sense to make a) depend on answer to b), unless there are some extreme issues. I believe the answer to a) should depend mostly on whether we as a community care for the project, whether the project is in a state that we think it is sufficiently likely to succeed in its goals and whether the project is sustainable without being a burden for current devs.

Since Aki (if I understand it correctly) is basically bringing the developer-power with the project I think sustainability is at least mid-term not a big issue. I also believe that Stan doesn’t make a very strong commitment to maintain the project indefinitely by bringing it under umbrella. The stan-dev org already has several projects that are not actively maintained, e.g. MathematicaStan, and I don’t think it brings any noticeable negative externalities.

I agree that the scoping issue is relevant: should the Stan community support projects that are not directly/closely/exclusively bound to Stan as a tool?

My current answer to this question would be: The project needs to serve some needs of Stan proper, but if the same functionality is useful for other projects, I think it can be a win-win situation for Stan to handle it (e.g. because we can get help with the project from outside the Stan community). The situation would be different if there was another platform/organization to which the project would fall more naturally.

Currently we already have projects that fall into this category (especially bayesplot, projpred, posterior).

I am not sure I follow the arguments: since Stan (as a tool) does not (and I presume never will) depend on posteriordb, why is it important that dependencies for posteriordb are appropriate for Stan? Or are you using “Stan” in broader sense than “Stan as a tool”?

2 Likes

Last day to voice your opinions everyone! I’ll be wrapping this thread at around 10pm Eastern Time.

S.

1 Like

Hello everyone,

The SGB had a meeting today and we all agreed on giving this thread a bit more time. The main reasons for the extra time is because this is the first time we will test the voting mechanism and we want to ensure that as many comments as possible have been addressed before moving to vote. Plus it was an eventful weekend in the US.

During voting time, people can still provide comments, disagreements & feedback but it would ideal if people have enough information about the project in advance so the vote can move relatively quickly.

Once again, thank you all!

2 Likes

I want to comment on the question of adding projects into the Stan ecosystem more generally before going into specifics for posteriordb.

Official Stan Projects

I think that we need to be very careful with the intent of introducing projects into Stan, especially as Stan grows in popularity. In my opinion projects should be added to Stan if and only if they support the fundamental goal of Stan – providing an expressive probabilistic programming language for users to specify sophisticated, bespoke models, algorithms capable of accurately fitting those models, and interfaces to both.

Projects that build upon these tools but do not directly support them do not strictly need to be part of the Stan ecosystem; that is not a value judgement of those projects but rather an appropriate compartmentalization. If anything including tools that build upon Stan introduces implicit valuations, as inclusion lends an implicit, if not explicit, authority to those projects. This is especially true when there are multiple packages that accomplish the same goal. See for instance the Python ecosystem where there are no official numerical, statistical, or graphical libraries but instead many packages that build upon core Python to implement these features.

For a long time the stan-dev repository was a proxy for the projects, including more speculative research projects, coming out of the Columbia team and their collaborators and I think that there has been lots of confusion/bad feelings about what has been included and what hasn’t. Moving forwards, especially as the Stan project itself grows, I think that a more general inclination towards including fewer projects than more will avoid the most problems both culturally and technically. I especially hope that as we move towards a more general Stan contributor designation the prestige of being an official Stan developer become less important and we can recognize developers of packages that build off of the core Stan tools.

For example contrast MatlabStan to MathematicaStan. Both are useful interfaces to the core Stan code with limited developer support, but only one is included in stan-dev. Both are relatively simple interfaces to CmdStan – is there any major difference between the two? Is there any technical reason why one an “official” project verses the other other? Again this is a separate question of whether or not the contributions of Brian and Vincent should be acknowledged, as the obviously should.

posteriordb

As I wrote in my first post I do think that the general goals of posteriordb are aligned with the goals of the core Stan tools. The more empirical testing of probabilistic computation the better, at least if that empirical testing is supported by enough theory to ensure rigorous results. And that, to me, is the critical detail.

There are many heuristics for diagnosing bad probabilistic computation, but those heuristics have a wide range of theoretical foundations. Fundamentally Bayesian computation, and hence Stan, is about computing expectation values and any empirical benchmark tool should be framed around the estimation of expectation values. For MCMC in particular that requires the various terms that help support the existence of a MCMC central limit theorem and then the terms that the CLT uses to construct expectation value estimators and standard errors.

This is a poorly taught but critical point. In particular many use MCMC samples to look at empirical pushforward distributions which are only subtly related to expectation values. Formally if the mean of the pushforward distribution along f : X \rightarrow \mathbb{R} exists then it is equal to the expectation value E_{\pi}[f]. Other features of the pushforward distribution can be related to the expectation of compositions of f with other functions, for example the variance with E_{\pi}[f^{2}], etc. In other words the behaviors of the pushforward distribution only takes on a formal, testable meaning when framed as an expectation value.

Consequently if the functionality of posteriordb is limited to verifying the estimation of expectation values in a high-level independent way then it is a great fit for the core goals of Stan in my opinion. If, however, the posteriordb developers want to include more features that support tools beyond Stan’s then it’s less appropriate. I cannot emphasize this enough – whether or not posteriordb is a good fit for inclusion into Stan says almost nothing about the quality of the code or its overall utility to the projects! Again inclusion should not be taken as a quality judgement.

I am being especially pedantic here because I have been around the many probabilistic programming communities for a long time and know how little overlap there is between the goals of Stan and other projects. I have seen people trash Stan as being too limited and too slow compared to other projects – basically called the opposite of every buzzword that’s required in NeurIPS abstracts ;-) – only to see those projects retired for the next new hotness while Stan continues to push forward serving its applied community. To be clear there’s nothing necessarily wrong with the volatile nature of research and hence evolving goals that motivate benchmarks for other packages; rather I believe that there is insufficient overlap between the goals of Stan and the goals of other, more machine-learning oriented projects for a benchmark that serves all to be particularly useful for Stan in isolation.

To be precise there has been no general discussion of the inclusion of these projects into Stan and hence their appropriateness. As I mentioned above I think that we need to be very careful to separate out project policy from historical precedent, the latter of which was often influence more by the evolving nature of Stan and internal politics than any coherent project wide decision making.

Inclusion formalizes posteriordb as the official benchmarking used and recommended by the Stan project. If that benchmarking uses diagnostic code provided by external projects then we’ve either ceded responsibility to those projects or have to push on them to implement things in a certain way. Neither is necessary given the C++ implementations of diagnostics already available for all interfaces to use.

3 Likes

I understand the concerns and I don’t think I am currently in a good position to comment on them further and hope others do so. I am however a bit unclear what your stance on the inclusion of posteriordb in particular as I don’t think you’ve stated that explicitly (or I misunderstood you). Would it be fair to summarize your position as “Provided posteriordb is better aligned with the goals of Stan than it currently seems to be, I would support its inclusion”? Or am I misunderstanding?

Yes, and in my opinion that alignment is tied to the implementation details I brought up as well as some of Aki’s comments about serving the general probabilistic programming community.

@betanalpha Thanks for following up with more comments!

Yeah, that’s a good question and I have no clue! The repositories for the Julia interface are also not currently in stan-dev but I don’t know why that’s the case. It could just be an oversight or perhaps there was another reason in the past that we don’t have a record of anymore. It’s definitely true that historically we weren’t very consistent about these things and I agree we should be more formal about this going forward.

Yeah I think that’s fair. I also think the circumstances are a bit different for different packages. For example, the bayesplot and posterior packages are different than the loo and projpred packages in the sense that bayesplot and posterior are just splitting out functionality (plotting, manipulating draws, and diagnostics) that already existed in worse form inside of RStan. When splitting them out into separate packages we drastically improved the code and made them more maintainable and usable. There can of course be disagreements about which diagnostics and plots to emphasize, but those packages didn’t really introduce much new to Stan that wasn’t already included in RStan (whether they should have been in RStan in the first place is a fair but different question). So bayesplot and posterior just gave us more maintainable versions of things that already existed. On the other hand, packages like loo and projpred really did bring in substantially new functionality and methodology and I think that if we were starting them today they would have to go through this same approval process.

I definitely see what you mean about ceding responsibility to those other projects. @avehtari @mans_magnusson can you comment on to what extent posteriodb is relying on diagnostic code from other projects?

1 Like

I’m bit confused on what is the list of implementation details. Also “some of Aki’s comments” is not clear.

I think it would be useful to first discuss the different goals of the project and how they align with the needs of Stan project. The current use case scenarios is at https://github.com/MansMeg/posteriordb/blob/master/doc/use_cases.md. These use case descriptions have taken into account Stan developer comments we asked in discourse. I’m listing just the titles here

  • Testing
    • Testing implementations of inference algorithms with asymptotically decreasing bias and variance (such as MCMC)
    • Testing implementations of inference algorithms with asymptotic bias (such as Variational inference)
    • System testing
    • Performance testing
  • Efficiency comparisons of inference algorithms with asymptotically decreasing bias and variance
  • Explorative analysis of algorithms
  • Developing new algorithms for interesting models
  • Code examples

Based on this I would assume that the above use case list is ok?

But then this indicates that that the above use case list is not ok?

I agree that verifying the estimation of expectation values in a high-level independent way is one important use case, but we do list also other use cases. For example, regularly made performance testing of new Stan releases to check that there is no regression in the performance fits the core goals of Stan in my opinion. Also I think it’s useful that we include difficult posteriors for which we can’t currently get verified expectation values to push the algorithm development forward. We are tagging different posteriors, so that different use cases can easily pick a set of posteriors that are suitable for the specific use case (e.g. whether they have reference expectation values).

We list also explorative analysis that doesn’t need to be formal.

I’m not certain but I guess that “tools” mean other probabilistic programming frameworks. If so, that discussion is orthogonal to the list of goals above.

This is one of the reasons why posteriordb project started, to make it more explicit what we think is relevant and required for good comparisons. We have seriously considered making the posteriordb useful for other communities to improve how things are done, but we have also thought that the support for other frameworks should be limited and the focus would be in Stan. Do you object any collaboration with other probabilistic programming communities? If 95% of posteriordb is for Stan, is 5% a show stopper to call it a Stan project? Would you like that 5% to be moved to another package?

We are currently talking with others to align goals. The goals will not be exactly the same, but there is useful overlap. By talking to others we are influencing them to see also our point of view.

Yes. Although I would prefer to say: The official benchmarking used and recommended by the Stan project will be documented and implemented in posteriordb. This was the reason we started work on this. Currently Stan project recommendations are scattered and recommendations are less likely used if there is no easy to use software.

We’re happy to get help getting more details written of Stan project recommendations for each use case. The use cases we list have come up in discussions with Stan developers. One of the use cases is making GitHub - stan-dev/stat_comp_benchmarks: Benchmark Models for Evaluating Algorithm Accuracy more easy to expand and check that the current “preliminary empirical results” can be upgraded from preliminary (the models/posteriors in stat_comp_benchmarks have been use also in other use cases, but in other use cases it was even more important to have a wider set of models).

Can you specify which of the listed use cases you are worried to use diagnostic code provided by external projects? We do mention MCMC diagnostics in reference posterior page https://github.com/MansMeg/posteriordb/blob/master/doc/REFERENCE_POSTERIOR_DEFINITION.md which is based on also your comments, but there we don’t define how the diagnostics should be computed. The github repo main page illustrates the database part using posterior package diagnostic. posterior is a Stan project used by CmdStanR. For Python the related package is ArviZ which an external project used by CmdStanPy. ArviZ developers include some Stan developers and they are willing to have the same diagnostics in R and Python. Discussion of why CmdStanR and CmdStanPy are not using C++ implementations is worth it’s own thread.

Same reliance as for CmdStanR, CmdStanPy and PyStan, that is either posterior or ArviZ. So no extra dependencies.

4 Likes

Hi everyone,

In addition to @avehtari previous comments, I can also add that I would be able to take a leadership role in this repo/project together with Aki, even though I have now left Aalto for Uppsala University.

As Aki mentioned, as of now we do not rely on other packages for any statistical inference in the posteriordb package. Also, I think we should probably avoid that as much as possible, as has been noted by Michael.

Although, I think that this project has quite a lot of use cases. So it is also important to not write define all details right now nor limit the scope. Our focus will be on Stan, but adding other PPL will not hurt the use cases for the Stan project. Probably, getting more people from other PPL interesting will probably also improve the quality in general of posteriordb, which in turn would benefit also to the Stan project.

/Måns

3 Likes

The only thing I can think to add is that the focus seems to be on the pros and cons of including posteriordb in Stan. However, you can also invert the question to consider the pros and cons of not including posteriordb.

In my opinion there is a real need for a formal, large-scale testing and benchmarking framework for Stan. This would provide defined “goalposts” for future contributions, i.e. to either match or improve on the previous test results. The testing framework for adding a new function to the Stan Math library already does this and I think the nature of a having a PASS vs. FAIL test gives confidence that all contributions have been objectively validated. In contrast, adding new algorithms and other types of functionality to Stan, while obviously still evidence-based, is a less well-defined process. This is where having something like posteriordb helps a lot - so if not posteriordb, what is the alternative?

2 Likes