Splitting data and combining sub-posteriors for “big” data


#1

Suppose we have so much data that running a Stan model in one chain is not feasible. What about the approach of (randomly) splitting the data into, say n subsets of equal size that are amenable to simulation. For each subset we can then obtain a sub-posterior for the parameters of interest (using a number of independent chains). How would one then go on and combine these n sub-posteriors? Is this a valid approach, has someone used this with Stan before? Any experiences and thoughts?

A naive method would be to simply merge the parameter samples of the n sub-posteriors (assuming equal number of samples). However I think this seems to simple to be true (and some reweighting might be required?)


#2

See here: https://arxiv.org/abs/1502.01510.
An intuitive idea that came to mind is that with any Bayesian method using a subset of data will result in posteriors that are wider than they should be.


#3

Thanks, I have to read the paper. While skimming it I was wondering whether this is actually what I was referring to, since I wasn’t suggesting to use the posterior from MCMC over a data from a single sub-sample (or varying sub-sample of fixed size within MCMC) but rather combining (in some clever way) all the sub-posteriors from the different MCMCs done on sub-samples that partition the entire data. My hope was that the problem that each sub-posterior on its own might not be representative would in some sense me compensated by combining the sub-posteriors… But as I said maybe the paper covers this; I have to read it carefully…


#4

I agree that the paper discusses a different type of sub-sampling than you were describing. From page three of the paper:

The performance of any such subsampling method depends
critically on the details of the implementation
and the structure of the data itself. Here I consider
the performance of two immediate implementations,
one based on subsampling the data in between Hamiltonian
trajectories and one based on subsampling the
data within a single trajectory. Unfortunately, the performance
of both methods leaves much to be desired.


#5

I think the paper Expectation propagation as a way of life: A framework for Bayesian inference on partitioned data answers many of your questions.