Beta-release Bayesian Posterior Database

Finally, we are starting to finalize a beta version of the posterior database, a storage of models, data, posterior and posterior draws (with information on how they were computed) to facilitate easier use of known Stan models for algorithm and diagnostic development, testing and teaching.

The main goal is to have all model code, data and so forth in one format for easy access both from Python and R that is extensively tested to work. The thought is also to be a simple way of communicating posteriors, i.e. just give a posterior name and everything can be accessed from posteriordb.

posterior is a json description binding things mentioned below together.
model is a model (currently only Stan, but hoepfully also other PPF in the future)
data and model defines the posterior
gold standard is a verified (almost) independent posterior draws computed with either dynamic HMC or analytically.

To access a posterior from R, just use

remotes::install_github("MansMeg/posteriordb", subdir = "rpackage/")
library(posteriordb)
eight_schools <- posterior("eight_schools")
# Access Stan code
stan_code(eight_schools)
# Get data
get_data(eight_schools)
# Get posterior draws (returns a draws object from the posterior R package)
gold_standard_draws(eight_schools)

or

pkpd <- posterior("pkpd")
stan_code(pkpd)
get_data(pkpd)
gold_standard_draws(pkpd) # No posterior has been computed yet though

For more info on details, see the github page here: https://github.com/MansMeg/posteriordb
Currently, the database has the Stan benchmark models by @betanalpha (except the simulation of multivariate normal draws), Bayesian linear regression, and a Latent Dirichlet Allocation model and data. The next goal is to try to fill it up with 200+ Stan example models and then just continue to fill it up.

The posteriors, data and models have keywords, so it is possible to select a specified set for testing, e.g. “stan_benchmark” for the set currently used to benchmark Stan, or “multimodal” if we know that the posterior distribution is multimodal. It is also possible access the dimension of the posterior, so it is possible to run tests only for models with more than 100 parameters.

We’ll add more functionality to make testing and summarizing the test results easier.

Although starting out with Stan, the idea is to allow other probabilistic programming benchmarks as well in the long run.

Tagging people who may be interested

  • @betanalpha We have tried to add the benchmark models, but may have missed something when we tried to describe the models.
  • @stevebronder for speed benchmark testing?
  • @bbbales2 for adaptation testing?
  • @yuling or our discussion on LDA model/data and model
  • @breckbaldwin for making stan a reference. The next step is methods for comparing posteriors, but the rest is now done.

/MĂĄns

11 Likes

I really like the idea and I am glad you’ve created that.

I am however somewhat underwhelmed by the gold standard criteria - for non-analytic posteriors shouldn’t you also require that the model that generated them passes some level of Simulation-based calibration (SBC) checks? SBC is AFAIK the strongest check we currently have, and is especially powerful to determine when you compute a valid posterior (no warnings, Rhat, n_eff, …) but it is actually a different one than you believe you have computed. Sometimes you can also get a no warning Stan run that ends actually being slightly problematic after SBC… So just a thought… Obviously you don’t have to do this, or maybe there is a good reason you want to avoid SBC that I am missing

1 Like

Excellent progress and well done.

Breck

1 Like

First, we accept PRs with SBC validation done which can be mentioned in keyword and you could then use only posteriors with SBC keyword, but…

We are accepting also posteriors conditional on real data and then we don’t have that model that generated the data.

SBC can’t validate that a specific posterior sample is valid. SBC is using simulated data from the model it is checking, ie, the model is the true data generating mechanism. Real data might be different and then there is model misspecification and posterior might be quite different. SBC is also averaging over draws from prior, and thus not validating any specific posterior. If we would be adding simulated data from known model, we would be interested in calibration near the parameters used to actually generate the data and thus we could use SBC with quite narrow distribution around those parameter values, but even then it can’t validate that one specific posterior sample.

With stochastic algorithms we always have the possibility of sometimes.

SBC is very useful in analysing whether certain algorithm is good for certain model over range of parameter values. Probably should require also running SBC with different amounts of observations (e.g. shape of funnel in hierarchical models depends on amount of data, too). I would recommend people to use SBC with any algorithm for completely new type of models, before submitting such models to posteriordb. The golden standard can be made also with other algorithms than dynamic HMC in Stan. However, as discussed above SBC is not sufficient diagnostic.

We expect to have many posteriors which are known to work well with dynamic HMC. Many posteriors which are reliably sampled with dynamic HMC, might be still difficult, e.g. for ADVI or there can be big differences also dynamic HMC sampling speed given different mass matrices. I think it would be unnecessary to run SBC for all of those models related to “easy” posteriors, but I would not reject additional PRs reporting successful SBC results and adding SBC keyword.

We would like this to be a community effort and we are not setting restrictions how many check people do for the existing or new models and posteriors. @bgoodri and @Bob_Carpenter have done some work on making SBC testing easier. With keywords we also have the possibility that you can avoid all cases where SBC keyword is missing.

Did this help?

2 Likes

Thanks, that clarifies a lot and makes a ton of sense.

Hi–we’ve already done this for the 400 or so Stan example models in R. We did it a couple years ago, though, so maybe it needs to be updated. It wasn’t hard; I think Rayleigh Lei just wrote a script and did it, based on the rstan models that Ben Goodrich has set up. So feel free to add these to your list.

Hi all!

@martinmodrak SBC is definitely on the list, and it should probably be included further on. An additional problem to Akis list - based on my knowledge of SBC and Stan - is that we need to simulate from the prior, and there is no way to simulate data from the prior given a stan model. Hence we would need to add this model by model by specifying priors for each model.

@andrewgelman Thats great! Where can I find those models and data? I guess that will speed up things.

The API for the database looks really nice and overall this is a good start, but I am a bit concerned with the potential ambiguity around the emphasis on samples and the ambiguity of “gold_standard”.

Ultimately each entry in the database defines a probability distribution, and the interest should be in specifying baseline expectation values with respect to that distribution. An ensemble of exact samples allows for Monte Carlo estimators of expectation values of sufficiently nice functions (i.e. cube integrable functions) with quantifiable, albeit probabilistic, error which can then be used to define a baseline when the error is sufficiently small. The problem with returning samples is that all of the estimator construction and validation falls on the user.

I understand that it would be significantly more work, but what I would really like to see is functionality to give the “posterior” object a function and return a validated estimator (checking cube integrability empirically) with baseline error (estimated empirically) which the user could then use to make principled comparisons to other algorithms. Or alternatively a set of default functions those expectation value estimators have already been run and validated.

Tangentially I would also recommend more samples – I believe that I used at least 100,000 for the stat_comp_benchmark baseline expectation values. The standard error should really be beaten down to the sub percent level.

The other issue is with the definition of “gold_standard” – what exactly is it? Exact samples carry with them different guarantees than very long dynamic HMC chains, for example. In particular given a multimodal model like LDA I wouldn’t trust MCMC output at all.

One option would be to add tiers beyond gold. For example, gold for having exact PRNG samples and silver for running ensembles of long dynamic HMC chains with verified diagnostics. Or gold for exact, silver for models with simulated data verified with both HMC diagnosis and with SBC, and bronze for just HMC.

Relatedly gold_standard_info should also include the version of R, the makevars file, the systemInfo – anything that could influence the values created. It should also include any diagnostic checks that were run so that the user knows how they were validated (for example split Rhat for all coordinate functions below 1.1, no divergences, no E-FMI, etc).

Finally a word of warning – the stat_comp_benchmark models were designed to be a very low bar and I recommend that you consider focusing on models more similar to actual applied use to fill out the database before considering its use to empirically validate methods and the like.

4 Likes

Thanks @betanalpha for the useful comments.

Can you make this suggestion more concrete? We’re still in progress of adding tools for comparing new results to “gold standard” and suggestions with as many details as possible are welcome. The idea of storing “gold standard” draws was that the user is not limited to pre-defined set of expectations, but we are happy to get recommendations for any default set.

Preferably we would want have more than 100k, but the current choice is a compromise. We could sample for some posteriors even more than 100k draws and compute set of pre-defined expectations to save time and memory of the users of the database. Those using Stan could also anytime get more draws if there is doubt that 10k is not enough. As the database is meant for many things, I’m not sure if we would like to have 100k for all.

Sure, and we don’t currently have “gold standard” for LDA and it’s set to NULL. We think it’s useful to have also models and posteriors which we know to be difficult. Using the keywords and whether “gold standard” exists, it is possible to run tests for only the cases you trust.

The database has information how the “gold standard” has been obtained, but it might also useful to have this kind of tiers as shorthand notation for the most common choices.

I agree

We know that about stat_comp_benchmark and that is one reason why we have been working on this.
We hope this will be a community effort and people will submit pull requests for their favorite models and posteriors. And even if not making the effort for complete pull request, even just recommending models with best practices Stan code and description why the model/posterior is interesting.

3 Likes

Hi, looks good. Is there a way to download these models from online storage? I would be interested in adding a wrapper for ArviZ (like we have currently for some datasets)

data = az.load_arviz_data("centered_eight")
1 Like

+1 this!

1 Like

Also cool and good! If this is ready for more public consumption it could be good to tweet about this using the Stan twitter

Mans:

Yes, I definitely think it would be good for you to include these examples. We set up the simulations as a way of testing ADVI (which in its current implementation often failed; we have some thoughts on how to improve it, but that’s another story.)

I’ve put it all at http://www.stat.columbia.edu/~gelman/temp/advi_test.zip
You should also get Ben Goodrich’s advice on this, as I think he may be the one who set up these models in rstan.

I’ve only used this corpus once (for the above-mentioned ADVI study, which unfortunately we never wrote up; in future I want us to write up everything we do!), but in that one case it was very helpful, so it would be great to include it in our Stan corpus.

Regarding Martin’s comment: Here we’re not doing SBC; we’re just running NUTS for multiple long chains in order to get a baseline that can be used to compare to alternative algorithms such as ADVI, GMO, or whatever.

My concern is that most users won’t know the technical details of how to use the gold standard samples properly and hence possibly misuse them. If the posteriordb had functionality to take in a function and compute baseline expectation values then it could handle the principled estimation of baseline values and establish the convention of comparing algorithms based on how well they recover those values. For example,

> model <- posterior("model_name")
> baseline_expectation(model, my_function)

Baseline expectation value for "my_function" is 5.832 +/- 0.001

An alternative would be to have a list of precomputed functions, something like

> model <- posterior("model_name")
> precomputed_expectations(model)

function_name1
function_name2
function_name3

> baseline_expectation(model, "function_name1")

Baseline expectation value for "function_name1" is 5.832 +/- 0.001

Yeah, larger ensembles can be really demanding on the database throughput, but I worry that with smaller samples the baseline expectation values won’t be accurate enough. If users will be set up to run Stan to generate more samples then couldn’t they compute their own expectation values in the same way? In other words having precomputed high-precision expectation value estimators for a default set of functions would provide a lightweight baseline that users could then extend however they might need.

Sure, I think the challenge will just be refining the user experience so that users understand this clearly enough. It’s worth carefully thinking about now but not to the point of slowing development as they can always be changed in the formal release.

I would personally err on the side of having too much stratification here as opposed to less to find the common tendency to presume that everything within a strata is equivalent. In other words if a user wants to test against models that have been validated in different ways then they have to explicitly groups them together and acknowledge that.

Totally. I was just emphasizing that we should not yet consider testing against it as sufficient motivation for algorithm development and the like. Running tests to help the development for the posteriordb is definitely worthwhile.

1 Like

I agree, and currently we are missing functions and recommendations for the usage, but they are on todo list.

Thanks, this was clear example and we should include something like this.

Maybe if we would compute some pre-defined expectations with larger sample size, and include the smaller sample so that it is possible to also easy to compute other expectations and make visualizations like envelope plots comparing two distributions?

We are including the full Stan command, so if the user notices that the accuracy is not sufficient they can then run more. Idea of having something pre-computed was to have faster experimenting and in some cases also making experiments without installing Stan, e.g. when testing new algorithms implemented in Python.

Yes. What would be a good set? Marginal means and variances, but what else?

I agree with the rest of your comments

One thing we are trying to make soon, is a page describing different use scenarios which would hopefully clarify the many objectives we have been thinking.

1 Like

Sure, especially if gold_standard_draws had a warning like “Returning a relatively small ensemble of verified draws which should not be used for precision comparisons”.

My ultimate fear is that users will try to compare things by eye using envelope plots or histograms or density estimator plots or empirical CDF plots and the like that don’t immediately communicate the uncertainty due to the limited sample size and facilitate users taking some deviations too seriously or not taking some deviations seriously enough.

Yeah, there shouldn’t definitely be some precomputed output that doesn’t require running Stan. I personally would prefer that that output be expectation values themselves, so that for example someone writing up their own algorithm in Python wouldn’t make heuristic comparisons based on qualitative sample visualizations instead of quantitative comparisons to the expectation values.

The preponderance of stats papers that validate new algorithms by comparing kernel density estimates from samples makes me very sad and the posteriordb gives us an opportunity to push against that convention a bit.

Maybe some interval probabilities, especially in cases where the marginal means and variances might be super noisy? Something like indicator[theta > 0] would be useful for well-centered models, but this would require some curation to find a decent threshold to test against for general models.

Looking forward to it.

1 Like

This is a really great resource that we will certainly be using and contributing to. It does start raising some interesting questions about the demographic of problems that users want to tackle and our ability to convey information about the relative benefits of different algorithms in the contexts of those problems. Fun ahead!

2 Likes

The

model <- posterior("model_name")
baseline_expectation(model, my_function)

interface looks like something, we should implement in the posterior package (https://github.com/jgabry/posterior) which is already used by posteriordb. Of course, not just for the baseline but any kind of expectation. In fact, we already have some ideas in that regard, given that I understand your idea correctly (https://github.com/jgabry/posterior/issues/8).

1 Like

Hi all!

Nice discussion and very good suggestions for further development.

@ahartikainen It should be great to set up a connection to get posteriors, data, and models directly from Github through Python. Hopefully, this can also enable quick access to a RESTful API further on. Unfortunately, connections to Github are not implemented for Python yet and I’m not sure when it will be. I’ll set it up as an issue, and it would be great with some help from the Python people here. I don’t think it is much more work that is needed.

@betanalpha @avehtari @paul.buerkner:
I agree with the expectations and that parameter expectations and variances could be computed using 100 000 draws and stored separately in addition to the smaller sample. If we get a “real” database up we could extend the number of samples to 100 000 further on.

Regarding what expectations to use, I think, as @paul.buerkner noted, this should probably be implemented more generally in the posterior package. Now the samples for posteriordb are returned as a posterior draws object to facilitate quick analysis of the draws. Hence implementing estimators should preferably be included in the posterior package, the idea would be available more generally in the Stan R ecosystem.

I also think that SBC should be included as a slot on each model and framework (Stan in this case) to test the specific model at hand. I put that on the todo as well as including more details regarding:

  • R/Python, the makevars file, the systemInfo() (and equivalent for Python)
    Is there anything more that should be included for reproducibility?

Regarding diagnostics. Currently, we only check divergencies, Rhat and ESS (bulk and tail). Besides we should add that E-FMI is higher than 0.3. Is there anything else that should be checked for references posterior draws?

The next step is also to try to find a way to compare two distributions based on samples and try to find some measure of the distance between distributions. I know @breckbaldwin would ideally want to have one measure that could be used. Any thoughts or pointers there @betalalpha? We have been discussing maximum mean discrepancy (MMD) loosely but have not yet got started with this: http://www.jmlr.org/papers/volume13/gretton12a/gretton12a.pdf

@andrewgelman Thank you for the link! I may get back to you with questions on this.

ESS / iteration > 0.0001

And the E-FMI threshold should be only 0.2.

Honestly I do not think that this is a productive direction. These kinds of general tests are just too ambitious and any finite ensemble of samples, even exact ones, don’t carry enough information to characterize the entirety of a distribution.

Even in the asymptotic regime the power of any test will depend intimately on the assumed form of the alternative model, and only overly simplistic alternatives will allow analytic results at all. See for example the discussion in Section 3.2 of the MMD paper.

In the preasymptotic regime MMD is limited to the span of the unit ball RKHS which is a relatively small class of functions and consequently, even when everything else works out, we get bounds on only a few expectation values and not the most common (means, variances, etc, which are far outside of the unit ball). To do any better one needs explicit bounds on the tail behaviors of the target distributions.

At an even deeper level this direction has to fight against a fundamental limitation of distances/divergences between measures – how to interpret non-zero values? Non-zero value means that the two measures are different, but in general we can’t translate a value to explicit bounds on how far expectation values can be. For example a finite KL divergence can’t even bound the mean of two Gaussians!

The best you can get out of distance/divergence approaches is bounds on the difference of expectation values of a limited set of functions, and hence their utility will depend on how well that limited set covers the functions of interest in practice. Without being able to complement samples with analytic bounds on the actual measures being compared that coverage is pretty small.

Consequently to get reasonable coverage we’d have to empirically quality the expectation values which brings us back to having a list of specified functions and their baseline expectation values.