Subsampling in parallel and MCMC

Bob_Carpenter · April 25, 2019, 3:41pm

I am working on lots of simulations and here’s a recent plot I came up with.

Subsampling data and fitting each subsample separately leaves you with draws that are far away from the data resulting from the posterior that combines all the data. (First 15 plots) Posteriors for 15 simulated data sets of size N = 50 for a logistic regression model with intercept (alpha) and single predictor (beta) with weakly informative priors. (Final plot) Posterior resulting from combining all 15 simulated data sets for a data set of size N = 750.

The Model

y_n \sim \mbox{bernoulli}(\mbox{logit}^{-1}(\alpha + \beta \cdot x_n)
\alpha, \beta \sim \mbox{normal}(0, 2),

with parameters

\alpha = 1 and \beta = -1.

and data simulated as

x_n \sim \mbox{normal}(0, 1).

In-depth discussion

This was motivated by theoretical discussions here:

Michael Betancourt. The Fundamental Incompatibility of Hamiltonian Monte Carlo and Data Subsampling.

and the computational issue illustrated in Figure 1 here:

Aki Vehtari et al. Expectation propagation as a way of life: A framework for Bayesian inference on partitioned data.

yizhang · April 25, 2019, 4:05pm

Instead of subsampling the data, does it make sense to use the combined data but sample different region of the parameter space in parallel? This would require gather the samples from each process though.

hhau · April 25, 2019, 4:21pm

Some recent papers have simulation studies / nice pictures that are similar to the ones in the OP: (these are just the ones that I’ve read semi-closely, they cite a numbers of others that look at examples of this type).

https://projecteuclid.org/euclid.ba/1502265628

hhau · April 25, 2019, 5:09pm

This is the idea behind umbrella sampling / parallel tempering. The tricky part is that if one weights the target distribution to learn about a different part of parameter space, then one needs to accurately compute the normalising constant for each of the weighted targets in order to combine the samples from the various targets (challenging in general).

yizhang · April 25, 2019, 5:30pm

I vaguely remember there’s some auxiliary distribution to be applied to resolve the multi-modality issue. The challenge mostly is on dealing with modes with different structures. But for more regulated modes the method seems work. Note that the idea is also in line with multiple-try mcmc, which is more brute force but also shows performance improvement.

Bob_Carpenter · April 25, 2019, 6:55pm

The idea behind that EP paper I link is similar. It uses what’s known as a “cavity distribution” to approximate the contribution from the other batches. It’s not completely parallel, but low communication. I don’t know if it’s ever been implemented or even prototyped on real problem. The paper’s one of these massive jobs where a dozen people shoveled in everything they were thinking about EP.

yizhang · April 25, 2019, 7:01pm

yeah, I always get confused when the term “subsampling” is used, to some people it means sub data, to some it means sub space.

Bob_Carpenter · April 25, 2019, 7:03pm

There are dozens of these papers out there.

I think the “deep invertible transformations” thing is doing something similar to the EP paper I cite, only in a general way based on deep neural nets rather than by exploiting the form of the EP approximation. The “deep invertible” thing is hot at the intersection of ML and MCMC. Matt Hoffman just gave a talk at Columbia on applying it to conditioning Hamiltonian Monte Carlo (they had a paper a while back on this topic, but they told me that didn’t really work).

The meta-analysis paper is similar in that it’s using a deep neural net approximation of posteriors. There’s so much going on at Aalto these days! Aki was also involved in the EP paper and is also working with Sam Kaski. Ideally in meta-analysis, we’d build a big joint model of all the data, not of the summaries; but in practice, we need to take what we can get.

[edit: Oh, I missed the last one first time around. That’s just using GPs instead of neural nets for the approximation. A single layer neural net approaches a GP in the limit of infinitely many intermediate nodes. There’s been some recent work on stacking GPs (using NNGP approximations) to compute covariance a la deep neural nets, too. For instance, for more work from Google, see this paper by Lee et al..

Bob_Carpenter · April 25, 2019, 7:03pm

Statisticians like uncertainty so much, all their key technical terms are ambiguous :-) Is there a better word that subsampling for subsetting the data?

yizhang · April 25, 2019, 7:14pm

No idea, too many variants of the methodology to filter the nomenclature.

Note that subsetting data fits stan’s map_rect framework, but subsampling space does not(assuming a large set of combined data).

hhau · April 25, 2019, 8:06pm

Surely this is a conservative lower bound :).

yuling · April 25, 2019, 9:04pm

Could you illustrate the specific rescaling algorithms you use here?

Bob_Carpenter · April 29, 2019, 7:29pm

What do you mean by rescaling? I wasn’t doing any kind of rescaling in the plot—it’s just 15 posteriors and then the posterior from combining all 15 data sets. Here’s the code:

library(rstan)
library(ggplot2)

printf <- function(msg, ...) cat(sprintf(msg, ...)); cat('\n')

inv_logit <- function(u) 1 / (1 + exp(-u))

program <- "
data {
  int<lower = 0> N;
  vector[N] x;
  int y[N];
}
parameters {
  real alpha;
  real beta;
}
model {
  alpha ~ normal(0, 2);
  beta ~ normal(0, 2);
  y ~ bernoulli_logit(alpha + x * beta);
}
"

model <- stan_model(model_code = program)

G <- 15
N <- 50
x <- matrix(rnorm(N * G), N, G)
alpha <- 1
beta <- -1
y <- matrix(rbinom(G * N, 1, inv_logit(alpha + beta * as.vector(x))), N, G)

df <- data.frame()
for (g in 1:G) {
  fit <- sampling(model, data = list(N = N, x = x[, g], y = y[ ,g]), refresh = 0)
  ss <- extract(fit)
  df <- rbind(df, data.frame(alpha = ss$alpha, beta = ss$beta,
                             run =  rep(paste("run", g), 4000)))
}
big_fit <-
  sampling(model,
           data = list(N = G * N, x = as.vector(x), y = as.vector(y)),
	   refresh = 0)
ss <- extract(big_fit)
df <- rbind(df, data.frame(alpha = ss$alpha, beta = ss$beta,
                             run =  rep("combined", 4000)))

plot <-
  ggplot(df, aes(x = alpha, y = beta)) +
  facet_wrap(run ~ .) +
  geom_hline(yintercept = beta, size = 0.25, color = "red") +
  geom_vline(xintercept = alpha, size = 0.25, color = "red") +
  geom_point(size = 0.1, alpha = 0.1) +
  scale_x_continuous(lim = c(-0.5, 2.5), breaks = c(0, 1, 2)) +
  scale_y_continuous(lim = c(-4, 2), breaks = c(-4, -1, 2))

plot

Topic		Replies	Views
Splitting data and combining sub-posteriors for “big” data General	7	1901	July 7, 2020
Aggregation posterior from small data sets General	3	508	July 30, 2022
Idea for out of chain parallel MCMC General	2	764	July 13, 2017
Bayesian inference after multiple imputation Modeling	7	1200	March 22, 2024
Posterior predictive one dataset per sample or block of samples? General	2	469	December 14, 2018

Subsampling in parallel and MCMC

Related topics