Request for comments: Bayesian Posterior Database

mans_magnusson · August 15, 2019, 2:00pm

Hi everyone!

Me, @avehtari, @paul.buerkner and Eero Linna has put together the first draft of a posterior database. The idea is to collect posterior distributions (i.e. data and models) and potential gold standard posterior samples in a database for easy access. Now it only contains a few examples, but we hope that if we set the structure and general idea, it can be relatively quick to add new posteriors, models and data. The repository README should contain all necessary information.

All comments are welcomed!

Hopefully, we are also soon done with bayesbenchr an R package that can be used to evaluate different diagnostics and inference methods for the posteriors in the posterior database.

ahartikainen · August 15, 2019, 2:29pm

Hi, sounds great!

ArviZ would be interested to implement a wrapper for this.

We currently have some simple example posteriors so users can start to play with them.

github.com

arviz-devs/arviz/blob/master/arviz/data/datasets.py

"""Base IO code for all datasets. Heavily influenced by scikit-learn's implementation."""
from collections import namedtuple
import hashlib
import itertools
import os
import shutil
from urllib.request import urlretrieve

from .io_netcdf import from_netcdf

LocalFileMetadata = namedtuple("LocalFileMetadata", ["filename", "description"])

RemoteFileMetadata = namedtuple(
    "RemoteFileMetadata", ["filename", "url", "checksum", "description"]
)
_DATASET_DIR = os.path.join(os.path.dirname(os.path.abspath(__file__)), "_datasets")

LOCAL_DATASETS = {
    "centered_eight": LocalFileMetadata(
        filename=os.path.join(_DATASET_DIR, "centered_eight.nc"),

This file has been truncated. show original

import arviz as az
idata = az.load_arviz_data("centered_eight")

ahartikainen · August 15, 2019, 4:37pm

It would be interesting to also have the following

posterior (default)
posterior predictive
prior
prior predictive
elementwise loglikelihoods

mans_magnusson · August 16, 2019, 6:43am

Thats great!

Is there some way we could build a python package that would fit these needs/enable it to use in arviz? I know Eero is working with a python package. So any suggestions there? I think we would like to have the python and R api quite similar so you could look at the R api and see if there are something you miss or would like to change.

Your suggestions on priors, likelihoods, predictives etc is somthing we have discussed and as you mention those would be valuable to have. The question is how to include those. We can get it from the stancode currently. So then the question is if we should include them as R and Python functions.

/Måns

hhau · August 16, 2019, 11:58am

Cool idea! I’ve seen something similar floated in the conclusion of https://arxiv.org/abs/1904.04484, and I’m wondering if you have any thoughts about:

How precise should the submitted posteriors be? What are the precision metrics? (minimum tail ESS?)
Given the substantial computational requirement for some posterior distributions, is it sensible to also store (potentially generative) approximations to the posteriors based on the “gold standard samples”? These approximations could be used to cheaply generate an arbitrary number of samples for the purpose of making figures look good. I guess one can sample from the posterior predictive distribution (PPD) an arbitrary number of times given a fixed sample from the posterior, but what if generating from the PPD is computationally expensive? (Perhaps not the most likely of scenarios)
What kind of metadata are you interested in collecting? (Sampler type / adaptation diagnostics / tuning parameters?) It would be interesting to know the computational time / cost / environment used to achieve the archived samples. This is definitely off topic (more related to ML than MCMC-based statistics) but it would be interesting to collect this information to estimate the energy usage in a manner similar to: https://arxiv.org/abs/1906.02243. I’ve seen a few models that have substantial compute requirements and subsequently churn through a lot of cluster time & credits (See Section 4.2 of https://projecteuclid.org/euclid.aoas/1560758424 for example) .

These things are probably beyond the scope of your current interests, but I’d be interested to know your collective thoughts on them. I really do like the idea of reusing posterior distributions in future analyses :).

ahartikainen · August 16, 2019, 1:17pm

The simplest thing would be to give the results somehow as a dictionary (e.g. json). That can then easily be transformed to InferenceData with az.from_dict.

We use xarray.dataset for each group.

avehtari · August 16, 2019, 2:42pm

Practically independent draws obtained by long chains with no divergences thinned to desired number of draws

avehtari · August 16, 2019, 2:43pm

Yes

Topic		Replies	Views
Beta-release Bayesian Posterior Database Publicity	21	2317	December 14, 2019
Posteriordb v 1.0.0 released General	1	274	May 5, 2025
Posteriordb arXiv preprint Publicity	0	102	July 18, 2024
Posteriordb, beta version 0.2 General	3	495	September 23, 2020
Beta release of our 'posterior' R package Announcements posterior-package	2	737	June 15, 2020

Request for comments: Bayesian Posterior Database

Related topics