Splitting data and combining sub-posteriors for “big” data

ermeel · May 12, 2018, 4:55pm

Suppose we have so much data that running a Stan model in one chain is not feasible. What about the approach of (randomly) splitting the data into, say n subsets of equal size that are amenable to simulation. For each subset we can then obtain a sub-posterior for the parameters of interest (using a number of independent chains). How would one then go on and combine these n sub-posteriors? Is this a valid approach, has someone used this with Stan before? Any experiences and thoughts?

A naive method would be to simply merge the parameter samples of the n sub-posteriors (assuming equal number of samples). However I think this seems to simple to be true (and some reweighting might be required?)

Guido_Biele · May 12, 2018, 5:20pm

See here: https://arxiv.org/abs/1502.01510.
An intuitive idea that came to mind is that with any Bayesian method using a subset of data will result in posteriors that are wider than they should be.

ermeel · May 12, 2018, 5:52pm

Thanks, I have to read the paper. While skimming it I was wondering whether this is actually what I was referring to, since I wasn’t suggesting to use the posterior from MCMC over a data from a single sub-sample (or varying sub-sample of fixed size within MCMC) but rather combining (in some clever way) all the sub-posteriors from the different MCMCs done on sub-samples that partition the entire data. My hope was that the problem that each sub-posterior on its own might not be representative would in some sense me compensated by combining the sub-posteriors… But as I said maybe the paper covers this; I have to read it carefully…

Guido_Biele · May 12, 2018, 7:26pm

I agree that the paper discusses a different type of sub-sampling than you were describing. From page three of the paper:

The performance of any such subsampling method depends
critically on the details of the implementation
and the structure of the data itself. Here I consider
the performance of two immediate implementations,
one based on subsampling the data in between Hamiltonian
trajectories and one based on subsampling the
data within a single trajectory. Unfortunately, the performance
of both methods leaves much to be desired.

avehtari · May 12, 2018, 8:43pm

I think the paper Expectation propagation as a way of life: A framework for Bayesian inference on partitioned data answers many of your questions.

Bob_Carpenter · June 15, 2018, 11:04pm

Right. IIf you subsample the data into 100 minibatches, each minibatch’s posterior will be 10 times as wide as the combined data’s posterior (assuming the model is well-behaved in the face of increasing data, that is).

See Aki et al.'s Expectation as a Way of Life arXiv paper, which focuses on exactly this issue and recommends using the cavity distribution to mitigate the problem. This induces some lightweight communication, but it’s going to be worth it if there are lots of minibatches.

mat_weldon · September 4, 2018, 2:02pm

There is also consensus MCMC, which uses a similar idea of analytically combining using a normal approximation. The difference is that each MCMC sample (i.e. iteration) is combined with other MCMC sample(s) from the other subsets “particle-wise”, so the result is a combined set of samples rather than an analytical approximation to the full posterior. Having done both EP and consensusMCMC I offer the following comparison:

ConsensusMCMC Pros:
the output is a set of samples that (to some extent that the Scott paper is a bit vague about) preserve some of the non-normality/skewness/overdispersion etc of the actual full posterior;

No iteration necessary. In fact, computing the combined posterior is dead easy using R package parallelMCMCcombine on the result of a Stan fit. So, much quicker and much less coding effort;

EP Pros:
Allows to apply the full prior to each subset of the data - useful for regularisation;

Allows the inference on each sub-sample to condition on the other sub-samples, which may also help with regularisation when each shard of the data has very little information about some parameters. The price of this is a lot of linear algebra, having to worry about non-positive definite matrices and multiple iterations to convergence.

Can be used within a multilevel model where hyperparameters cross sub-samples but lower level parameters do not.

RJTK · July 7, 2020, 5:25pm

I’ve not used any of these methods, but I’ve seen a line of work where they combine the subset posteriors by finding the Barycenter w.r.t. the Wasserstein distance: http://proceedings.mlr.press/v38/srivastava15.pdf I think this is distinct from the suggestions above.

Topic		Replies	Views
Subsampling in parallel and MCMC Algorithms mcmc	12	1845	April 29, 2019
Aggregation posterior from small data sets General	3	514	July 30, 2022
Idea for out of chain parallel MCMC General	2	767	July 13, 2017
Can one substracts two posteriors that are not sampled from the same model? General techniques	7	952	June 29, 2022
How to combine posterior draws/samples from separate datasets/models? brms brms	4	1113	April 6, 2022

Splitting data and combining sub-posteriors for “big” data

Related topics