Idea for out of chain parallel MCMC

this may be a naive question, but I am wondering about the validity of following idea:

  1. take k (for example k=10) random subsets of the data, run a sampling chain for every subset (say 1000 warmup iterations, 100 sample iterations).
  2. for every set of 100 parameter samples do a Bayesian update with the rest of the dataset (the k-1 other subsets), resulting in weights for every sample. This should not be too computationally expensive since we compute on a discrete parameter space.
  3. all those k*100 weighted samples together should be a good approximation of the posterior distribution

The advantage here is the fully parallel computing procedure, so we could take advantage of HPC clusters.
Is there something I am missing?

I see that the 100 samples are not very good for estimating the marginal likelihood. Alternatively the updating could be done sequentially for every subset, thus always using the 900 leftout samples.


Well with any method, step one is laying it out and figuring out exactly what you’re getting right and what you’re getting wrong. If you’re cutting data into pieces, that’d be the place to start. Maybe things can be recombined later maybe not.

When it comes to statistical approximations, it’s easy to get caught up in the idea that whatever small assumption you’ve had to make won’t be that big a deal in with whatever problem you’re working on and you’ll still get at the true posteriors you’re after. Practically though, it’s hard enough to figure out a useful model and get good sampling on it even with an exact algorithm. It’s fun to play with this stuff, but it’s hard to trust it.

Here’s a thing from Betancourt about the 8-schools model in Stan (simple model, small data, fancy algorithm -> still really hard to get it right):

Here’s a thing by Bob on ensemble methods:

Best of luck!

edit: changed desc. of the Bob link

Or you could read Michael Betancourt’s paper,

and then follow that up with Andrew Gelman, Aki Vehtari, et al. on EP (which uses the cavity distribution to mitigate some of the problems with the kind of naive subsampling you and others suggest):