I want to comment on the question of adding projects into the Stan ecosystem more generally before going into specifics for posteriordb
.
Official Stan Projects
I think that we need to be very careful with the intent of introducing projects into Stan, especially as Stan grows in popularity. In my opinion projects should be added to Stan if and only if they support the fundamental goal of Stan – providing an expressive probabilistic programming language for users to specify sophisticated, bespoke models, algorithms capable of accurately fitting those models, and interfaces to both.
Projects that build upon these tools but do not directly support them do not strictly need to be part of the Stan ecosystem; that is not a value judgement of those projects but rather an appropriate compartmentalization. If anything including tools that build upon Stan introduces implicit valuations, as inclusion lends an implicit, if not explicit, authority to those projects. This is especially true when there are multiple packages that accomplish the same goal. See for instance the Python ecosystem where there are no official numerical, statistical, or graphical libraries but instead many packages that build upon core Python to implement these features.
For a long time the stan-dev
repository was a proxy for the projects, including more speculative research projects, coming out of the Columbia team and their collaborators and I think that there has been lots of confusion/bad feelings about what has been included and what hasn’t. Moving forwards, especially as the Stan project itself grows, I think that a more general inclination towards including fewer projects than more will avoid the most problems both culturally and technically. I especially hope that as we move towards a more general Stan contributor designation the prestige of being an official Stan developer become less important and we can recognize developers of packages that build off of the core Stan tools.
For example contrast MatlabStan to MathematicaStan. Both are useful interfaces to the core Stan code with limited developer support, but only one is included in stan-dev
. Both are relatively simple interfaces to CmdStan
– is there any major difference between the two? Is there any technical reason why one an “official” project verses the other other? Again this is a separate question of whether or not the contributions of Brian and Vincent should be acknowledged, as the obviously should.
posteriordb
As I wrote in my first post I do think that the general goals of posteriordb
are aligned with the goals of the core Stan tools. The more empirical testing of probabilistic computation the better, at least if that empirical testing is supported by enough theory to ensure rigorous results. And that, to me, is the critical detail.
There are many heuristics for diagnosing bad probabilistic computation, but those heuristics have a wide range of theoretical foundations. Fundamentally Bayesian computation, and hence Stan, is about computing expectation values and any empirical benchmark tool should be framed around the estimation of expectation values. For MCMC in particular that requires the various terms that help support the existence of a MCMC central limit theorem and then the terms that the CLT uses to construct expectation value estimators and standard errors.
This is a poorly taught but critical point. In particular many use MCMC samples to look at empirical pushforward distributions which are only subtly related to expectation values. Formally if the mean of the pushforward distribution along f : X \rightarrow \mathbb{R} exists then it is equal to the expectation value E_{\pi}[f]. Other features of the pushforward distribution can be related to the expectation of compositions of f with other functions, for example the variance with E_{\pi}[f^{2}], etc. In other words the behaviors of the pushforward distribution only takes on a formal, testable meaning when framed as an expectation value.
Consequently if the functionality of posteriordb
is limited to verifying the estimation of expectation values in a high-level independent way then it is a great fit for the core goals of Stan in my opinion. If, however, the posteriordb
developers want to include more features that support tools beyond Stan’s then it’s less appropriate. I cannot emphasize this enough – whether or not posteriordb
is a good fit for inclusion into Stan says almost nothing about the quality of the code or its overall utility to the projects! Again inclusion should not be taken as a quality judgement.
I am being especially pedantic here because I have been around the many probabilistic programming communities for a long time and know how little overlap there is between the goals of Stan and other projects. I have seen people trash Stan as being too limited and too slow compared to other projects – basically called the opposite of every buzzword that’s required in NeurIPS abstracts ;-) – only to see those projects retired for the next new hotness while Stan continues to push forward serving its applied community. To be clear there’s nothing necessarily wrong with the volatile nature of research and hence evolving goals that motivate benchmarks for other packages; rather I believe that there is insufficient overlap between the goals of Stan and the goals of other, more machine-learning oriented projects for a benchmark that serves all to be particularly useful for Stan in isolation.
To be precise there has been no general discussion of the inclusion of these projects into Stan and hence their appropriateness. As I mentioned above I think that we need to be very careful to separate out project policy from historical precedent, the latter of which was often influence more by the evolving nature of Stan and internal politics than any coherent project wide decision making.
Inclusion formalizes posteriordb
as the official benchmarking used and recommended by the Stan project. If that benchmarking uses diagnostic code provided by external projects then we’ve either ceded responsibility to those projects or have to push on them to implement things in a certain way. Neither is necessary given the C++ implementations of diagnostics already available for all interfaces to use.