Let me make a few general comments before the more precise inline comments below.
Firstly I think that we want to be really careful about feature bloat. Many projects decay because they incorporate too many features that are sort of related to their original goal, diluting their original goals while also increasing maintenance burdens to the point where nothing can get done. I believe that we should be conscientious of these issues as we grow and be careful to keep the goals of Stan focused, ensuring as consistent and rigorous of a workflow as we can to our users without also forcing upon them false defaults. In my opinion we want to tell users how to fit their models, not what models to fit and how to communicate them to their collaborators.
Secondly I think that we need to be careful about the precision of the stated goals for potential Stan projects, and how those goals are realized in the design. Here itâs one thing to talk about various uses of models, but what does that actually mean for what the design of posteriordb
? Is posteriorpb
just a collection of models for arbitrary use or are they models designed for precise uses? Are those uses inclusive to Stan or are they more far reaching?
For example a precise goal is validating probabilistic computation for Stan, in which case each entry would contain a model (math and or Stan specification), data, and for various variables validated expectation values (if they exist). Critically those models would cover only the the modes targeted by Stan; in particular it would not include models with discrete parameters.
Another precise goal is covering the functionality of the Stan Math Library, which would motivate Stan programs (no data required) making expensive use of certain functions/operations over and over again to facilitate empirical performance regression testing.
Note that these designs could be useful for many of the other vague goals that have been discussed, and thatâs awesome (for example a large database of validated models would be useful for teaching Stan, for teaching modeling, for teaching basic algorithms, for performance comparison between algorithms, etc). But the key is that the design is pegged on the precise goals relevant to Stan and not influenced by anything else.
In my opinion if posteriordb
focused on one of those precise goals then I would have no problem with its inclusion. I start to get a little more hesitant when posteriordb
is presented more as a grab bag of models (and model outputs!), some of which are useful to Stan and others that arenât. In that case it seems more natural to me for posteriordb
to be its own project that serves a much wider community than Stan, which is great. Critically Stan could always use the parts of posteriordb
useful for Stan tasks, but it wouldnât restrict the scope of posteriordb
this way.
Iâm pretty much in complete agreement here @jonah, although I would like to comment on some of the subtleties.
I think one of the benefits of refactoring of RStan
into core RStan,
posterior, and
bayesplot` is illuminating what is part of the core functionality of âStanâ that we want consistent across all of the interfaces and what is more open-ended.
My concern with posterior
, which as you note was really an issue inherited from the RStan
functionality, is a fragmentation of the core Stan workflow. While the interfaces all call the same algorithm code they donât call the same analysis/diagnostic code, instead implementing it themselves in increasing divergent ways; cmdstanpy
using external analysis/diagnostic code is particularly troubling to me. Were posterior
and other such REPL-local packages just wrappers around the implementations in https://github.com/stan-dev/stan/tree/develop/src/stan/analyze
, or just reimplementations of that code, then we would be able to ensure a much more consistent workflow across interfaces.
With bayesplot
there is a similar question of interface consistency â if we have one official visualization package for one interface should we have the same for the others? Perhaps a more important question is whether it makes sense to talk about any âofficialâ Stan visualization, and consequently whether any visualization packages should be part of Stan.
Another way of stating the question is how do we communicate to users what parts of Stan are officially highly recommended and what parts are optional? In particular if the Stan ecosystem is more inclusive then it should probably be more inclusive to many more projects, and we need to clearly communicate to users what is core and what is auxiliary.
A similar governance question is how do we decide exactly what is core and what is auxiliary. For better or worse this discussion is bringing to the front these higher level issues and lots of of the organization debt that weâve been accruing as we try to establish a sustainable open source governance.
To avoid confusion I want to reorganize this a bit. I find many of these to be vague and overlapping.
For example from a probabilistic computation perspective thereâs not a huge difference between testing stochastic and deterministic algorithms or performance testing. All of these reduce to quantifying estimator error (the latter normalizing that error by computational cost).
System testing is orthogonal as it is focused more on coverage of the autodiff library than anything probabilistic computational. In particular the scope of models one would use to test autodiff performance/regression is very different from the scope of models one would use to benchmark inference algorithms.
I donât see how the exploration and development of algorithms goes beyond the first testing goal unless it simply implies a different scope of models of interest (for example to be somewhat facetious having a separate section for logistic regression on the UCI datasets for academic applications).
Ultimately I donât think that these goals are going to be all that helpful to the conversation anyways since itâs the actual code that is more important. For example if the code is designed for one and only one goal but it happens to be useful for other goals then great! But that doesnât mean that the accidental goals should necessarily be part of the package objectives.
Regression testing is somewhat ill-defined. Currently model fits are being used in Stan to test regression testing of the math library, but this is a bit of a kludge to get some testing in. The models donât really cover the scope of the math library well and are way too small to be really sensitive to small performance changes. Then there is a single end-to-end logistic test that checks that exact inference outputs donât change.
Note that none of those are directly related to the accuracy of any algorithm. Theyâre not testing the faithfulness of Stan just the raw speed and any changes to the outputs. Consequently this defines a completely different motivation and hence design.
And then how, exactly, are these models realized in the package? Do they include the model specification in math or in any particular probabilistic programming language? Do they include arbitrary Stan output without any guarantees on how that output relates to anything mathematically?
You list that as a vague goal. What are the implications for the package design? What functionality offered that supports exploration?
The question here isnât one of collaboration itâs one of design constraints. In particular exactly what are the relevant and good comparisons? For general Stan use that means something very precise â for many other communities itâs a vague and ill-defined concept thatâs often redefined for every new application. If posteriordb
supports the more general community then it may provide some functionality to Stan but it would not, in my opinion, be an appropriate Stan project. Again that doesnât mean that itâs not a worthwhile or useful project â just one that really doesnât fit entirely within the Stan ecosystem.
Without explicit definitions of how the diagnostics are computed we havenât really defined the diagnostics. For example âeffective sample sizeâ is abused to mean all âeffective sample size estimatorsâ regardless of their properties which causes no end of confusion amongst users working with different packages. This is one of the key issues with the package fragmentation â the terms being used are too vague and easy to be confused differently in the various packages. But also as you note, yes something worth itâs own thread.