Splitting data and combining sub-posteriors for “big” data


#1

Suppose we have so much data that running a Stan model in one chain is not feasible. What about the approach of (randomly) splitting the data into, say n subsets of equal size that are amenable to simulation. For each subset we can then obtain a sub-posterior for the parameters of interest (using a number of independent chains). How would one then go on and combine these n sub-posteriors? Is this a valid approach, has someone used this with Stan before? Any experiences and thoughts?

A naive method would be to simply merge the parameter samples of the n sub-posteriors (assuming equal number of samples). However I think this seems to simple to be true (and some reweighting might be required?)


#2

See here: https://arxiv.org/abs/1502.01510.
An intuitive idea that came to mind is that with any Bayesian method using a subset of data will result in posteriors that are wider than they should be.


#3

Thanks, I have to read the paper. While skimming it I was wondering whether this is actually what I was referring to, since I wasn’t suggesting to use the posterior from MCMC over a data from a single sub-sample (or varying sub-sample of fixed size within MCMC) but rather combining (in some clever way) all the sub-posteriors from the different MCMCs done on sub-samples that partition the entire data. My hope was that the problem that each sub-posterior on its own might not be representative would in some sense me compensated by combining the sub-posteriors… But as I said maybe the paper covers this; I have to read it carefully…


#4

I agree that the paper discusses a different type of sub-sampling than you were describing. From page three of the paper:

The performance of any such subsampling method depends
critically on the details of the implementation
and the structure of the data itself. Here I consider
the performance of two immediate implementations,
one based on subsampling the data in between Hamiltonian
trajectories and one based on subsampling the
data within a single trajectory. Unfortunately, the performance
of both methods leaves much to be desired.


#5

I think the paper Expectation propagation as a way of life: A framework for Bayesian inference on partitioned data answers many of your questions.


#6

Right. IIf you subsample the data into 100 minibatches, each minibatch’s posterior will be 10 times as wide as the combined data’s posterior (assuming the model is well-behaved in the face of increasing data, that is).

See Aki et al.'s Expectation as a Way of Life arXiv paper, which focuses on exactly this issue and recommends using the cavity distribution to mitigate the problem. This induces some lightweight communication, but it’s going to be worth it if there are lots of minibatches.