For Probabilistic Prediction, Full Bayes is Better than Point Estimators

I just wrote a case study that isn’t particularly Stan-related, but uses Stan.

  • Bob Carpenter. 2019. DRAFT: For Probabilistic Prediction, Full Bayes is Better than Point Estimators.
  • bayes-versus.pdf (364.3 KB)
  • Source code [GitHub]

Comments most welcome (especially if you know how to fix knitr’s table rendering in pdf).

Here’s the abstract:

A probabilistic prediction takes the form of a distribution over possible outcomes. With proper scoring rules such as log loss or square error, it is possible to evaluate such a probabilistic prediction against a true outcome. This short note provides simulation-based evaluation of full Bayesian inference, where we average over our estimation uncertianty, and two forms of point estimation, one that uses the posterior mode (max a posteriori) and one that uses the posterior mean (as is typical with variational inference). The example we consider is a simple Bayesian logistic regression with potentially correlated predictors and weakly informative priors. To make a long story short, full Bayes has lower expected log loss and squared error than either of the point estimators.

There’s also a bit on evaluating proper scoring rules.

I should’ve done this ages ago. I’ve done things like this in my repeated binary trial case study, but that was in the context of binomials and it was buried among a lot of other stuff. I commtted the pdf and html to the repo, so if you want the html, it’s there.

16 Likes

This is great. I think you should also show that we can estimate the ELPD of held-out data just fine using loo when full Bayes is used. So, there is an additional gain to be had by conditioning on all the available data. @avehtari also added some loo stuff for VB to the impending RStan.

3 Likes

Thanks for this new case study. They are extremely useful!

This reminds me of a paper by Marc Lavielle and Benjamin Ribba (https://rd.springer.com/article/10.1007%2Fs11095-016-2020-3, or see https://hal.archives-ouvertes.fr/hal-01365532/document).

In a non-Bayesian setting, they show that instead of maximizing each individual conditional distribution of model parameters, random sampling is to be preferred to obtain values better spread out over the marginal distribution of individual parameters.

1 Like

How do you sample over parameters in a non-Bayesian setting? (I think they use empirical Bayes, which means point estimating some parameters and then sampling over others.)

We don’t want marginal parameter distributions, we want joint ones for full Bayesian inference.

I was also focusing on probability estimation, not on Type I error rates.