Promoting posteriordb into an official Stan project

Let me make a few general comments before the more precise inline comments below.

Firstly I think that we want to be really careful about feature bloat. Many projects decay because they incorporate too many features that are sort of related to their original goal, diluting their original goals while also increasing maintenance burdens to the point where nothing can get done. I believe that we should be conscientious of these issues as we grow and be careful to keep the goals of Stan focused, ensuring as consistent and rigorous of a workflow as we can to our users without also forcing upon them false defaults. In my opinion we want to tell users how to fit their models, not what models to fit and how to communicate them to their collaborators.

Secondly I think that we need to be careful about the precision of the stated goals for potential Stan projects, and how those goals are realized in the design. Here it’s one thing to talk about various uses of models, but what does that actually mean for what the design of posteriordb? Is posteriorpb just a collection of models for arbitrary use or are they models designed for precise uses? Are those uses inclusive to Stan or are they more far reaching?

For example a precise goal is validating probabilistic computation for Stan, in which case each entry would contain a model (math and or Stan specification), data, and for various variables validated expectation values (if they exist). Critically those models would cover only the the modes targeted by Stan; in particular it would not include models with discrete parameters.

Another precise goal is covering the functionality of the Stan Math Library, which would motivate Stan programs (no data required) making expensive use of certain functions/operations over and over again to facilitate empirical performance regression testing.

Note that these designs could be useful for many of the other vague goals that have been discussed, and that’s awesome (for example a large database of validated models would be useful for teaching Stan, for teaching modeling, for teaching basic algorithms, for performance comparison between algorithms, etc). But the key is that the design is pegged on the precise goals relevant to Stan and not influenced by anything else.

In my opinion if posteriordb focused on one of those precise goals then I would have no problem with its inclusion. I start to get a little more hesitant when posteriordb is presented more as a grab bag of models (and model outputs!), some of which are useful to Stan and others that aren’t. In that case it seems more natural to me for posteriordb to be its own project that serves a much wider community than Stan, which is great. Critically Stan could always use the parts of posteriordb useful for Stan tasks, but it wouldn’t restrict the scope of posteriordb this way.

I’m pretty much in complete agreement here @jonah, although I would like to comment on some of the subtleties.

I think one of the benefits of refactoring of RStan into core RStan, posterior, and bayesplot` is illuminating what is part of the core functionality of “Stan” that we want consistent across all of the interfaces and what is more open-ended.

My concern with posterior, which as you note was really an issue inherited from the RStan functionality, is a fragmentation of the core Stan workflow. While the interfaces all call the same algorithm code they don’t call the same analysis/diagnostic code, instead implementing it themselves in increasing divergent ways; cmdstanpy using external analysis/diagnostic code is particularly troubling to me. Were posterior and other such REPL-local packages just wrappers around the implementations in https://github.com/stan-dev/stan/tree/develop/src/stan/analyze, or just reimplementations of that code, then we would be able to ensure a much more consistent workflow across interfaces.

With bayesplot there is a similar question of interface consistency – if we have one official visualization package for one interface should we have the same for the others? Perhaps a more important question is whether it makes sense to talk about any “official” Stan visualization, and consequently whether any visualization packages should be part of Stan.

Another way of stating the question is how do we communicate to users what parts of Stan are officially highly recommended and what parts are optional? In particular if the Stan ecosystem is more inclusive then it should probably be more inclusive to many more projects, and we need to clearly communicate to users what is core and what is auxiliary.

A similar governance question is how do we decide exactly what is core and what is auxiliary. For better or worse this discussion is bringing to the front these higher level issues and lots of of the organization debt that we’ve been accruing as we try to establish a sustainable open source governance.

To avoid confusion I want to reorganize this a bit. I find many of these to be vague and overlapping.

For example from a probabilistic computation perspective there’s not a huge difference between testing stochastic and deterministic algorithms or performance testing. All of these reduce to quantifying estimator error (the latter normalizing that error by computational cost).

System testing is orthogonal as it is focused more on coverage of the autodiff library than anything probabilistic computational. In particular the scope of models one would use to test autodiff performance/regression is very different from the scope of models one would use to benchmark inference algorithms.

I don’t see how the exploration and development of algorithms goes beyond the first testing goal unless it simply implies a different scope of models of interest (for example to be somewhat facetious having a separate section for logistic regression on the UCI datasets for academic applications).

Ultimately I don’t think that these goals are going to be all that helpful to the conversation anyways since it’s the actual code that is more important. For example if the code is designed for one and only one goal but it happens to be useful for other goals then great! But that doesn’t mean that the accidental goals should necessarily be part of the package objectives.

Regression testing is somewhat ill-defined. Currently model fits are being used in Stan to test regression testing of the math library, but this is a bit of a kludge to get some testing in. The models don’t really cover the scope of the math library well and are way too small to be really sensitive to small performance changes. Then there is a single end-to-end logistic test that checks that exact inference outputs don’t change.

Note that none of those are directly related to the accuracy of any algorithm. They’re not testing the faithfulness of Stan just the raw speed and any changes to the outputs. Consequently this defines a completely different motivation and hence design.

And then how, exactly, are these models realized in the package? Do they include the model specification in math or in any particular probabilistic programming language? Do they include arbitrary Stan output without any guarantees on how that output relates to anything mathematically?

You list that as a vague goal. What are the implications for the package design? What functionality offered that supports exploration?

The question here isn’t one of collaboration it’s one of design constraints. In particular exactly what are the relevant and good comparisons? For general Stan use that means something very precise – for many other communities it’s a vague and ill-defined concept that’s often redefined for every new application. If posteriordb supports the more general community then it may provide some functionality to Stan but it would not, in my opinion, be an appropriate Stan project. Again that doesn’t mean that it’s not a worthwhile or useful project – just one that really doesn’t fit entirely within the Stan ecosystem.

Without explicit definitions of how the diagnostics are computed we haven’t really defined the diagnostics. For example “effective sample size” is abused to mean all “effective sample size estimators” regardless of their properties which causes no end of confusion amongst users working with different packages. This is one of the key issues with the package fragmentation – the terms being used are too vague and easy to be confused differently in the various packages. But also as you note, yes something worth it’s own thread.

Keep in mind that posteriordb would not rectify this situation on its own. When testing algorithms one has to test not just the algorithm output but also its error quantification, which is often stochastic. Even knowing the true expectations values one has to do a lot more work to verifying that the error quantification is consistent (multiple runs, evaluating the distribution of error, etc). For many algorithms, like Markov chain Monte Carlo, this can’t be reduced to a deterministic test and so one has to rely on tricky hypothesis testing approaches for automated determinations.

It seems the discussion got a bit stuck, although I think some convergence of views happened. I think the ball is now primarily with people promoting the inclusion (@avehtari, @jonah, @mans_magnusson, maybe someone else?) - it seems to me that some of the disagreement @betanalpha is voicing is mostly about wording/understanding/making things explicit and could likely be overcome to satisfaction of everybody, but it is also possible that posteriordb project leaders would disagree fundamentally about some of the Mike’s ideas


Thanks, Martin!

I have tried to go through the whole thread to make a summary of how the discussion stands. In some parts, discussions are bigger than posteriordb, and I have tried to single them out to continue those discussions outside the issue of posteriordb.

So, the main question is whether to include posteriordb under the stan-dev umbrella or not. Below are the main arguments I could deduce for and against, including posteriordb.

Argument for including posteriordb

  1. It is essential that projects that work with Stan, but are not limited to Stan, become part of the Stan project. This development will expand the userbase and potential developers connected to the Stan project.
  2. Including other PPLs in addition to Stan will enable developers and users of other PPL to also contribute to Stan.
  3. The risk of the project becoming not actively maintained is relatively small, and the potential cost for a project not being actively supported is small. Hence the risk is not that large.
  4. Verifying the estimation of expectation values in a high-level independent way is a great fit for the core goals of Stan.
  5. The general goals of posteriordb are aligned with the goals of the core Stan tools.

Argument for NOT including posteriordb

  1. Since posteriordb also includes other PPLs than Stan, it is not sure if it should be included as a Stan project. Including more vaguely-related projects will broaden Stan too much. Instead, Stan should be mainly focused on the Stan core parts. This perspective does not mean that posteriordb is a bad project. It should just not be a part of stan-dev. If posteriordb supports the more general community, it may provide some functionality to Stan, although this is not sufficient to be an appropriate Stan project.
  2. posteriordb risk feature bloating of the Stan project. We will include too many things that risk increasing maintenance burden, and this risk a decay of the Stan project in the long term.
  3. posteriordb and the use cases are currently too vague. Projects included in Stan should be more precise in its goals and how posteriordb plans to achieve these goals.

Important things to do in posteriordb (that we all agree on)

  1. The posteriordb should be language agnostic (i.e., not an RStan project).
  2. We should minimize “outside” dependencies of posteriordb. Ideally, we should only use code within stan-dev to compute statistics of interest (such as ess_bulk and ess_tail, MCSE etc.).

Comments and suggestions outside posteriordb for further discussion

  1. Should summarize_draws (posterior) be changed to reflect MCMC CLT more explicitly?
  2. How do we decide exactly what is core and what is auxiliary in the Stan project?

Comment/discussions with disagreement

  1. Should only expectations of parameters be included or more extensive statistics such as quantiles, ESS values, etc.?
  2. How exactly should testing and benchmarking be conducted?
  3. The scope of posteriordb. Should posteriordb be limited to only testing of expectation or also more explorative analysis and benchmarking?
  4. Should the project be mainly focused on Stan models and the current limitations in Stan, or should it be used more broadly. I.e., should we include models with discrete parameters?
  5. The sharpness of the posteriordb project. How precise should the project’s goals be, and how and how those goals are realized in the design of posteriordb.

I think this summarizes the ongoing discussion (but please feel free to correct this if you think I misunderstood some arguments). I think the main argument against including posteriordb is argument 1, i.e., how broad should the Stan project be?
Personally, I think this is more a matter of just deciding than something that has a true answer. Hence I believe this discussion is something the Stan Governing Board should focus on, since that also has broader implications.

The second issue is how broad or specific posteriordb should be where it seems to be disagreements. For me, I think agility is quite important, so I would prefer not to limit the project beforehand too much (except the stuff we all agree on). Depending on user needs, we should extend and improve in that direction. That said, I think all the discussions (1-5) we should keep in mind as the project develops.

I hope this can help the Stan Governing Body in the next steps of the decision process.

/MĂ„ns

3 Likes

Thanks Mans, that’s a great summary. Super helpful.

I would say that when the @SGB approved the new voting procedure we essentially gave control over questions like this to the developers. So while I agree that the SGB should continue to think about (and act on) questions like how broad of a scope the Stan project should have, this particular decision on whether to move forward with including posteriordb (or other new projects) should be made by the Stan developers.

Here are a few other thoughts specifically on the “Arguments for NOT including posteriordb” :

  • I don’t think it’s a problem for an official Stan project to include other PPLs. In fact, I would think we should want to include other PPLs, at a minimum so that we can more easily do our own comparisons between Stan’s performance and other PPL performance on the same models. We certainly don’t want to introduce anything into Stan that could break if something breaks in a totally different PPL, but that’s not the case here. So as long as posteriordb is useful to the Stan project supporting other PPLs doesn’t seem like a problem to me and may even be an advantage.

  • Regarding feature bloat, this is an issue for all of the repositories in stan-dev. We should make our decision about posteriordb based on whether we think posteriordb is a good fit and then apply the same principles of combating feature bloat that we would for all other Stan projects. I’m not saying we always succeed in preventing feature bloat, but that’s not at all unique to posteriordb so I don’t think it’s a reason to exclude it from stan-dev.

  • Regarding the use cases being too vague, that’s possible but then we could make it less vague instead of rejecting it entirely.

So after all this discussion I’m still very much in favor of including it in stan-dev. That said, @betanalpha brought up a bunch of good points and I think it’s possible to incorporate a lot of that feedback to make posteriordb even better.

@mans_magnusson, thank you for the summary, very helpful.

I think there is a different possible path, PosteriorDB should be incubated by the Stan project and every year we can see if posteriordb merits going on its own.

My thinking is that posteriordb would benefit right now from being a Stan project for all the reasons listed. If it is successful, which I think it will be, then it can transition out in a year or two. It will be weird for other PPLs to evaluate on a Stan product.

Good work though, I am a big fan of posteriordb.

Breck

1 Like

@breckbaldwin That’s an interesting thought. I think if bringing in a new project requires a vote then removing a project that has existed for a year or two would also require a vote, so there couldn’t be any guarantee now of a transition out in the future. But I guess in theory it could happen if approved.

Do you mean that developers of other PPLs will ignore the results in posteriordb if it’s branded as a Stan project? If I didn’t misunderstand then I don’t actually think that’s true since Stan is already used by many other PPL developers as a reference (they compare to Stan all the time) and this would just provide an easier way to do that. From what I’ve heard it does sound like some other groups are working on something similar but not identical to posteriordb, but they’re going to do that whether or not posteriordb is a Stan project. Let me know if I misunderstood you though or if you disagree with this. I could be wrong.

1 Like

@jonah I do have the sense that other PPLs might think negatively about PDB being a Stan project. I could be wrong. But there are other reasons to cut it loose once incubated. It is sort of an odd sibling to other Stan projects and I do see long term maintenance concerns. In the short term it sounds great and makes sense. We could set the intention to revisit the issue in two years.

Breck

Thanks Breck. I’m curious about a few things you said so I hope you don’t mind if I keep asking follow up questions.

Ok yeah I guess that’s possible, I really don’t know. If you’ve heard that from people then you have more info than I do, so I could totally be wrong. If some other project X had something like posteriordb and it was really good I wouldn’t hesitate to use it just because it was made by X (unless X had a bad reputation for making crappy stuff). But maybe other people and other projects don’t feel that way.

Personally I think it fits really well with what our goals are. It isn’t even an entirely new thing for Stan projects to support other PPLs since almost all of our R packages for post-processing (e.g., bayesplot, loo, posterior, even shinystan) are designed to also work with models fit in other PPLs (they don’t depend on other PPLs but they don’t require Stan). posteriordb is a bit different but it’s not like posteriordb actually depends on functions inside of some other PPL (as far as I know).

Can you expand on that? Is that because of the support for other PPLs? @mans_magnusson would posteriordb break if some PPL it supported stopped being developed? Or is there anything in posteriordb that would break if something in one of the other supported PPLs broke? From what I understand that’s not the case. But if it is then I guess that’s a concern. If that’s not the case then why is the maintenance issue is any different than for anything else in stan-dev? I might be missing something important when it comes to maintenance though, so correct me if I’m wrong.

I’m not opposed to that (it seems like a good idea) but I don’t think we can guarantee anything because it will be based on voting. But yeah I guess we could certainly set a goal to revisit in two years.

Hi all,

Sorry, I thought it was a decision by the SGB. Good to know.

Some comments. No, nothing would break in posteriordb for Stan if another PPL is discontinued. The only thing is that we would have old PPL code in the database and some separate code for testing of these models. Although, it would be really simple to remove that kind of legacy code from other PPL if we would want to (and it is important that that is actually separate code by PPL so we easily can discontinue other PPLs).

It might be some maintenance concerns in that if the syntax of other PPL changes then the old model code might become broken. I think this can be easily fixed by storing test results along the way so we know that 8-schools works for Stan 2.23, but not say Stan 4.12. Since the focus is mainly model code I think this will most likely be a smaller problem - but it can happen (say going from Tensorflow 1 to TensorFlow 2 for example. Although, if the database becomes widely used, then I think (and hope) the whole PPL community can help out keeping it updated.

I do not actually think it would be that weird to evaluate a Stan product. As Jonah mentions, if it is simple and easy to use, and Stan has a good reputation, it simplifies your life.

/MĂ„ns

1 Like

Is there a wider umbrella that can capture projects that aren’t formally a Stan project but related to Stan? This is a bit tangential to the current discussion but I think merits some discussion. I’ll put here for now as it could be a path for posteriordb.

The discussion here is all relevant for inclusion as an official Stan project but the concerns @betanalpha bring up - about some projects being excluded (like MatlabStan and Stan.jl) for whatever reason (such as oversight, etc.), feature bloat, and clear alignment with Stan - mean we could possibly have our cake and eat it too with a wider Stan recognized ecosystem that is not officially a Stan project but acknowledged to be useful/relevant for Stan users.

Call it Stan Ecosystem (or my personal favorite Stan Suzerian see https://en.wikipedia.org/wiki/Suzerainty) that would have the official projects that are in https://github.com/Stan-dev and then include the wider-ecosystem that people can add without bureaucracy. Star or keep all the officially recognized parts at the top - front and center - while allowing the inclusion of other packages for users to easily browse.

The discussion that is occurring here would still occur for inclusion into an official project. But having the official and unofficial in one place can highlight some discrepancies more easily, for example like matlabstan not being included as an official project while StataStan is.

1 Like

@jonah I don’t think I’m explaining myself well so let me try one more time.

My concern is not other PPLs using posteriordb to make comparisons to Stan – that would be great.

I am mildly concerned with having explicit APIs for other PPLs, instead of a more flexible plugin framework that would avoid having to maintain consistency with those PPLs, but that’s a relatively small consideration.

By far my biggest concern is defining the fundamental goal of a PPL. I think we’re mostly in agreement that Stan is all about Bayesian inference – the user supplies a model and then Stan tries to estimate posterior exception values, and their error, as robustly as possible. The challenge is that not all PPLs have this same goal! I have been fortunate to have been around the PPL community for a long time, and in that time I’ve encountered a variety of different goals and even fluctuating goals within various projects. Of course for (academic) research projects this is somewhat expected, but it does emphasize that PPLs are not a monolithic community and tools that work for Stan will not be sufficient for other PPLs and vice versa.

The question then is what does it mean to support other PPLs? It’s one thing to facilitate other PPLs/codes using the posteriordb functionality, but it’s another for that functionality to be expanded to support PPL goals that go beyond those of Stan. To be clear this is not a judgement of the ambition for an expansive project that wants to support all of that functionality but rather a statement that such ambition just isn’t appropriate for an official Stan project, in my opinion.

For anyone reading please evaluate my comments in this context. A focus on functions and information about their expectations values is the aligned with the basics of probabilistic, and hence Bayesian, computation. The use of Stan implementations of MCMC expectation value and error estimators for baseline comparisons ensures that the baselines are consistent with Stan’s functionality.

2 Likes

I hope I am not misrepresenting others, but my understanding is that this is the root of the disagreement: Mike would like a relatively tight specification of the project and its goals. If I understand @mans_magnusson and others correctly, they consider such tight specification premature and not necessarily useful/desirable. I am also not sure the discussion brought us any closer to agreement on this point (while I fell agreement on other aspects has mostly been reached). If this is a correct assesment, I think moving to a vote might be the best course of action, possibly with voting on three options (exact formulation pending) “Include as is”, “Include if tighter specification is given”, “Do not include” - and require an option to get overall majority to pass (with possible second round with two options is this does not happen).

I would also personally like to resist a bit the call to define a “General process for including projects in Stan”, I think there is a huge variation in motivations to include a project that warrants a case-by-case approach and I don’t think inlcuding a new project is something we would do often enough to justify a rigid process - or to let us debug the process in place.

I think the idea to have some midle ground of “Stan endorsed” or “Stan affiliated” projects that are not necessarily core is interesting, but I am not sure if it adds anything beyond the current state - most of the projects that would fall into this category are routinely discussed on the forums, linked from the Stan website etc. so I think they are de facto endorsed already and I am not sure this additional structure would be so useful.

4 Likes

For user experience, especially new users, it is just not a great experience to rely on the forums to get all the “Stan endorsed” or “Stan affiliated” projects. I would like something simple like what R has with their “domain views”. For example, here is the site for the time series view https://cran.r-project.org/web/views/TimeSeries.html. It lists a small description and a link to the relevant package, organized by different focuses. I think this is simple, effective, and better than what currently exists.

2 Likes

Hi!

Sorry for my delayed response (crazy last week). Yes, I think that is a good summary of the disagreements and it would probably be good to progress to a vote.

/MĂ„ns

2 Likes

I’ve been busy, too, but today had time to check back to this.

It seems there’s still some confusion about the goals of projpred

  • I think the goals of posteriordb are the Stan goals, that is the best possible Bayesian inference. Including support for other PPLs would not change that, which if I understood correctly was one of @betanalpha’s worries.
  • We have talked (and still continue talking) this autumn with developers of PPLbench and Inference gym which are two other testing/benchmarking frameworks we are aware. With the developers of these other packages we agreed that the goals of these three packages are different and it’s sensible to continue as three separate projects. Thus there is no need to include support for goals that are not about best possible Bayesian inference, which if I understood correctly was one of @betanalpha’s worries.
  • We also agreed with other package developers that there is enough overlap that it’s helpful keep talking and sharing information. The main collaboration would be sharing models and reference results, but also agreeing on common summaries and model sets for the common goals such as the best possible Bayesian inference. If I understood correctly @betanalpha agrees that it would be great if other PPL devs would share the this goal.
  • Supporting other PPLs is planned to be lightweight, and with sharing models and refrence results with others the maintenance work is aimed to be kept minimal.

So if I’ve understood correctly there is not much disagreement after all?

4 Likes

Thanks for weighin in, I will call a vote in the upcoming few days (a bit busy today and want to make sure I summarise the situation right, so it will take me some time). My current thinking about this is that the best course would be to have 3 options - roughly “promote posteriordb without further conditions”, “promote posteriordb if it will be more tightly aligned with the Stan core” and “do not promote posteriordb”. The voting process is (intentionally so far) unclear on how to handle multiple options, but my preferred variant for this case would be that if any gets more than 50% votes it is selected, otherwise there will be a second round among the top two options.

As I said, I will take some time to make sure I summarise stuff well and that the formulations make sense. If you have any feedback on the rough sketch above or if you want to consult on the exact wording, let me know.

1 Like

I certainly don’t have any disagreements with those precise goals! I am particularly relieved to hear about the agreement of different goals between the different projects.

That said I personally think that much of the confusion has arisen from the fact that these precise goals have been been consistently held throughout the entire thread (perhaps just due to miscommunication, perhaps due to more recent meetings ). Perhaps more importantly in my opinion the current design of posteriordb is not yet completely aligned with these goals.

For example the simplest ways to achieve these goals is a schema where every entry in the database contains a model name, unstructured model metadata (say a description of the model, references, etc), and then a list of expectands and validated expectation values (for example this is the structure of stat_comp_benchmark). Provided that the expectation values for every entry are always validated by either analytic results or Stan then the database will always be compatible with Stan no matter how it is used (moreover other communities could fork the database and add their own entries with their own validation without compromising the original Stan-centered database). In this case I would be very excited to support the inclusion into Stan as I hope I have expressed in previous threads.

Part of the problem with the current posteriordb schema is its use of the posterior package which doesn’t adhere to this particular structure, focusing more on the behavior of the pushforward distribution of each expectand instead of just the expectation value. Much of this information can be broken up into multiple expectands and included in the scheme above, but it would require not using posteriordb directly.

I think that part of the issue here arises from the dual need of communicating the final results while also storing/communicating the validation for reproducibility and transparency. In my opinion the current posteriordb implementation intertwines these two steps too much so that the user-facing schema is too influenced by how the validation is performed. One quick resolution would be to decouple these, agreeing on a common, fixed user-facing schema and then working on the implementation details (unifying the validation to the base C++ instead of using interface-specific code, defining schema for sharing samples or other intermediate results, etc) moving forwards (for example moving forward with posteriordb as a Stan project where the scope of the implementation details can be well-defined).

Thoughts?