@andrewgelman, @lauren, @charlesm93 and I have been talking about performing a systematic analysis of (and writing a survey paper about) computational approaches to scaling Bayesian regression (linear, logistic, possibly hierarchical GLMs) to large data sets, where the qualifier “large” means something like

large enough to take an annoyingly/impractically long amount of time to perform posterior inference in a reasonable statistical model for the data.

In order to make the survey motivated by real statistical practice, I am looking for (descriptions of) examples of real data sets that people want to fit Bayesian GLMs on but that are giving computational problems, due to their size.

Question: could anyone provide me with illustrative examples of data sets and models they are working with?

Of course, the data itself may be sensitive. What I am really looking for is to get a sense of the structure of the data (sample size, number of covariates) and the corresponding model (any hierarchical structure used; particularly, the number of parameters and their hierarchical structure). The motivation is that we can then run our benchmarks in regimes of datasets and models that people are actually interested in.

Thank you!!

Also, if anyone is interested in taking part in our discussions/getting involved, please let me know!

I’m interested in discussions. @anon75146577 remembers a paper which did big spatial data inference comparison, which could give some useful ideas.

I know some open datasets with very large number of covariates so that computation time for Gaussian or logistic regression fits your description. I’ll send links via email (tomorrowish).

Most insurance datasets are proprietary, but we often work on either some datasets that are out there of reasonable size, or we generate fake datasets that look realistic enough to be useful.

I use stan on the IPUMS data (https://usa.ipums.org/usa/), it has millions of observations, as long as the models aren’t too complicated it runs as fast as I would expect. I also use it for smaller data (~100,000 observations) with run times in the day long time span. I also use it for spatially/temorally autocorrelated data on datasets in the tens to hundreds of thousands with run times in days.

I am working with data from cohort-studies which are not extremely large, but can be slow because of the likelihood function we use.

A typical example would be to have a regression with around 30000 individuals, each measured on multiple occasions (e.g. 3) and with multiple rating-scales (e.g. 5), so that we end up with for example 3x5 outcomes per individual we want to model. There are only around 5-15 predictors in the model.

An apparently innocent hierarchical model with these data can take more than a week to fit, if one uses a beta-binomial likelihood to model sum-scores (i.e. the sum of the ratings for a scale).

My more general point is that it would be great if you could also look at other than linear and logistic regressions, because when the likelihood function is not easy to evaluate, estimating models with not-so-very-large data can be quiet slow.

In typical single cell RNA sequencing datasets with 1000-100000 observations of 2000-10000 dependent and fewer independent variables sampling-based posterior inference is currently infeasible. For example, see this attempt to model the perturb seq dataset (doi:10.1016/j.cell.2016.11.038 ) by very simple models using STAN:

SCHiRM: Single Cell Hierarchical Regression Model to detect dependencies in read count data
Jukka Intosalmi, Henrik Mannerstrom, Saara Hiltunen, Harri Lahdesmaki doi:10.1101/335695

Economist here: firm level data across countries and time. Around 400,000 data points. Income statement and balance sheet data. Hierarchical models. Endogeneity issues.

The data comes from Compustat global and Compustat North America via the wrds interface at wharton business school (not my school but a common access point).