Splitting data and combining sub-posteriors for “big” data

Suppose we have so much data that running a Stan model in one chain is not feasible. What about the approach of (randomly) splitting the data into, say n subsets of equal size that are amenable to simulation. For each subset we can then obtain a sub-posterior for the parameters of interest (using a number of independent chains). How would one then go on and combine these n sub-posteriors? Is this a valid approach, has someone used this with Stan before? Any experiences and thoughts?

A naive method would be to simply merge the parameter samples of the n sub-posteriors (assuming equal number of samples). However I think this seems to simple to be true (and some reweighting might be required?)

1 Like

See here: https://arxiv.org/abs/1502.01510.
An intuitive idea that came to mind is that with any Bayesian method using a subset of data will result in posteriors that are wider than they should be.

Thanks, I have to read the paper. While skimming it I was wondering whether this is actually what I was referring to, since I wasn’t suggesting to use the posterior from MCMC over a data from a single sub-sample (or varying sub-sample of fixed size within MCMC) but rather combining (in some clever way) all the sub-posteriors from the different MCMCs done on sub-samples that partition the entire data. My hope was that the problem that each sub-posterior on its own might not be representative would in some sense me compensated by combining the sub-posteriors… But as I said maybe the paper covers this; I have to read it carefully…

I agree that the paper discusses a different type of sub-sampling than you were describing. From page three of the paper:

The performance of any such subsampling method depends
critically on the details of the implementation
and the structure of the data itself. Here I consider
the performance of two immediate implementations,
one based on subsampling the data in between Hamiltonian
trajectories and one based on subsampling the
data within a single trajectory. Unfortunately, the performance
of both methods leaves much to be desired.

I think the paper Expectation propagation as a way of life: A framework for Bayesian inference on partitioned data answers many of your questions.

2 Likes

Right. IIf you subsample the data into 100 minibatches, each minibatch’s posterior will be 10 times as wide as the combined data’s posterior (assuming the model is well-behaved in the face of increasing data, that is).

See Aki et al.'s Expectation as a Way of Life arXiv paper, which focuses on exactly this issue and recommends using the cavity distribution to mitigate the problem. This induces some lightweight communication, but it’s going to be worth it if there are lots of minibatches.

There is also consensus MCMC, which uses a similar idea of analytically combining using a normal approximation. The difference is that each MCMC sample (i.e. iteration) is combined with other MCMC sample(s) from the other subsets “particle-wise”, so the result is a combined set of samples rather than an analytical approximation to the full posterior. Having done both EP and consensusMCMC I offer the following comparison:

ConsensusMCMC Pros:
the output is a set of samples that (to some extent that the Scott paper is a bit vague about) preserve some of the non-normality/skewness/overdispersion etc of the actual full posterior;

No iteration necessary. In fact, computing the combined posterior is dead easy using R package parallelMCMCcombine on the result of a Stan fit. So, much quicker and much less coding effort;

EP Pros:
Allows to apply the full prior to each subset of the data - useful for regularisation;

Allows the inference on each sub-sample to condition on the other sub-samples, which may also help with regularisation when each shard of the data has very little information about some parameters. The price of this is a lot of linear algebra, having to worry about non-positive definite matrices and multiple iterations to convergence.

Can be used within a multilevel model where hyperparameters cross sub-samples but lower level parameters do not.

3 Likes

I’ve not used any of these methods, but I’ve seen a line of work where they combine the subset posteriors by finding the Barycenter w.r.t. the Wasserstein distance: http://proceedings.mlr.press/v38/srivastava15.pdf I think this is distinct from the suggestions above.