Big data sets for Bayesian regression?

Matthijs · September 18, 2018, 2:14pm

@andrewgelman, @lauren, @charlesm93 and I have been talking about performing a systematic analysis of (and writing a survey paper about) computational approaches to scaling Bayesian regression (linear, logistic, possibly hierarchical GLMs) to large data sets, where the qualifier “large” means something like

large enough to take an annoyingly/impractically long amount of time to perform posterior inference in a reasonable statistical model for the data.

In order to make the survey motivated by real statistical practice, I am looking for (descriptions of) examples of real data sets that people want to fit Bayesian GLMs on but that are giving computational problems, due to their size.

Question: could anyone provide me with illustrative examples of data sets and models they are working with?

Of course, the data itself may be sensitive. What I am really looking for is to get a sense of the structure of the data (sample size, number of covariates) and the corresponding model (any hierarchical structure used; particularly, the number of parameters and their hierarchical structure). The motivation is that we can then run our benchmarks in regimes of datasets and models that people are actually interested in.

Thank you!!

Also, if anyone is interested in taking part in our discussions/getting involved, please let me know!

avehtari · September 18, 2018, 3:41pm

I’m interested in discussions. @anon75146577 remembers a paper which did big spatial data inference comparison, which could give some useful ideas.

I know some open datasets with very large number of covariates so that computation time for Gaussian or logistic regression fits your description. I’ll send links via email (tomorrowish).

kaybenleroll · September 18, 2018, 4:18pm

I’m happy to get involved in this as well.

Most insurance datasets are proprietary, but we often work on either some datasets that are out there of reasonable size, or we generate fake datasets that look realistic enough to be useful.

Corey_Sparks · September 18, 2018, 4:42pm

I use stan on the IPUMS data (https://usa.ipums.org/usa/), it has millions of observations, as long as the models aren’t too complicated it runs as fast as I would expect. I also use it for smaller data (~100,000 observations) with run times in the day long time span. I also use it for spatially/temorally autocorrelated data on datasets in the tens to hundreds of thousands with run times in days.

-CS

Guido_Biele · September 18, 2018, 4:45pm

I am working with data from cohort-studies which are not extremely large, but can be slow because of the likelihood function we use.

A typical example would be to have a regression with around 30000 individuals, each measured on multiple occasions (e.g. 3) and with multiple rating-scales (e.g. 5), so that we end up with for example 3x5 outcomes per individual we want to model. There are only around 5-15 predictors in the model.

An apparently innocent hierarchical model with these data can take more than a week to fit, if one uses a beta-binomial likelihood to model sum-scores (i.e. the sum of the ratings for a scale).

My more general point is that it would be great if you could also look at other than linear and logistic regressions, because when the likelihood function is not easy to evaluate, estimating models with not-so-very-large data can be quiet slow.

jan-glx · September 20, 2018, 5:40am

In typical single cell RNA sequencing datasets with 1000-100000 observations of 2000-10000 dependent and fewer independent variables sampling-based posterior inference is currently infeasible. For example, see this attempt to model the perturb seq dataset (doi:10.1016/j.cell.2016.11.038 ) by very simple models using STAN:

SCHiRM: Single Cell Hierarchical Regression Model to detect dependencies in read count data
Jukka Intosalmi, Henrik Mannerstrom, Saara Hiltunen, Harri Lahdesmaki
doi:10.1101/335695

avehtari · September 21, 2018, 7:58am

This would be good example also because they are from my department!

Ilan_Strauss · December 5, 2018, 3:34pm

Economist here: firm level data across countries and time. Around 400,000 data points. Income statement and balance sheet data. Hierarchical models. Endogeneity issues.

The data comes from Compustat global and Compustat North America via the wrds interface at wharton business school (not my school but a common access point).

andrew222651 · September 16, 2019, 10:58pm

@Matthijs any update on the project?

EllenIAH · September 16, 2019, 11:30pm

I’d be interested in hearing about the results of this. If you’re still looking for examples, I have them and can share

lauro · October 25, 2020, 2:12pm

Hey! I know this is an old post, but have you written up the paper yet? I’d be curious to see the results.

tinosai · July 30, 2022, 7:04am

Any updates? Haven’t lost hope yet

Topic		Replies	Views
Stan on GPU: looking for model+dataset examples for empirical evaluation of speedups General	36	3436	March 5, 2018
Scalable Bayesian multilevel modeling Developers	37	3202	July 1, 2019
Bayesian Benchmarking 1.0 General	6	677	July 20, 2021
Examples of Stan being used in context of Bayesian inference / regression with large open datasets that relate to individuals such as census data etc General	5	715	January 28, 2020
Posteriordb v 0.4.0 Modeling	3	975	November 4, 2022

Big data sets for Bayesian regression?

Related topics