Subsampling in parallel and MCMC

I am working on lots of simulations and here’s a recent plot I came up with.

Subsampling data and fitting each subsample separately leaves you with draws that are far away from the data resulting from the posterior that combines all the data. (First 15 plots) Posteriors for 15 simulated data sets of size N = 50 for a logistic regression model with intercept (alpha) and single predictor (beta) with weakly informative priors. (Final plot) Posterior resulting from combining all 15 simulated data sets for a data set of size N = 750.

The Model

  • y_n \sim \mbox{bernoulli}(\mbox{logit}^{-1}(\alpha + \beta \cdot x_n)

  • \alpha, \beta \sim \mbox{normal}(0, 2),

with parameters

  • \alpha = 1 and \beta = -1.

and data simulated as

  • x_n \sim \mbox{normal}(0, 1).

In-depth discussion

This was motivated by theoretical discussions here:

and the computational issue illustrated in Figure 1 here:


Instead of subsampling the data, does it make sense to use the combined data but sample different region of the parameter space in parallel? This would require gather the samples from each process though.

1 Like

Some recent papers have simulation studies / nice pictures that are similar to the ones in the OP: (these are just the ones that I’ve read semi-closely, they cite a numbers of others that look at examples of this type).

This is the idea behind umbrella sampling / parallel tempering. The tricky part is that if one weights the target distribution to learn about a different part of parameter space, then one needs to accurately compute the normalising constant for each of the weighted targets in order to combine the samples from the various targets (challenging in general).

I vaguely remember there’s some auxiliary distribution to be applied to resolve the multi-modality issue. The challenge mostly is on dealing with modes with different structures. But for more regulated modes the method seems work. Note that the idea is also in line with multiple-try mcmc, which is more brute force but also shows performance improvement.

The idea behind that EP paper I link is similar. It uses what’s known as a “cavity distribution” to approximate the contribution from the other batches. It’s not completely parallel, but low communication. I don’t know if it’s ever been implemented or even prototyped on real problem. The paper’s one of these massive jobs where a dozen people shoveled in everything they were thinking about EP.

yeah, I always get confused when the term “subsampling” is used, to some people it means sub data, to some it means sub space.

There are dozens of these papers out there.

I think the “deep invertible transformations” thing is doing something similar to the EP paper I cite, only in a general way based on deep neural nets rather than by exploiting the form of the EP approximation. The “deep invertible” thing is hot at the intersection of ML and MCMC. Matt Hoffman just gave a talk at Columbia on applying it to conditioning Hamiltonian Monte Carlo (they had a paper a while back on this topic, but they told me that didn’t really work).

The meta-analysis paper is similar in that it’s using a deep neural net approximation of posteriors. There’s so much going on at Aalto these days! Aki was also involved in the EP paper and is also working with Sam Kaski. Ideally in meta-analysis, we’d build a big joint model of all the data, not of the summaries; but in practice, we need to take what we can get.

[edit: Oh, I missed the last one first time around. That’s just using GPs instead of neural nets for the approximation. A single layer neural net approaches a GP in the limit of infinitely many intermediate nodes. There’s been some recent work on stacking GPs (using NNGP approximations) to compute covariance a la deep neural nets, too. For instance, for more work from Google, see this paper by Lee et al..

1 Like

Statisticians like uncertainty so much, all their key technical terms are ambiguous :-) Is there a better word that subsampling for subsetting the data?

No idea, too many variants of the methodology to filter the nomenclature.

Note that subsetting data fits stan’s map_rect framework, but subsampling space does not(assuming a large set of combined data).

Surely this is a conservative lower bound :).

1 Like

Could you illustrate the specific rescaling algorithms you use here?

What do you mean by rescaling? I wasn’t doing any kind of rescaling in the plot—it’s just 15 posteriors and then the posterior from combining all 15 data sets. Here’s the code:


printf <- function(msg, ...) cat(sprintf(msg, ...)); cat('\n')

inv_logit <- function(u) 1 / (1 + exp(-u))

program <- "
data {
  int<lower = 0> N;
  vector[N] x;
  int y[N];
parameters {
  real alpha;
  real beta;
model {
  alpha ~ normal(0, 2);
  beta ~ normal(0, 2);
  y ~ bernoulli_logit(alpha + x * beta);

model <- stan_model(model_code = program)

G <- 15
N <- 50
x <- matrix(rnorm(N * G), N, G)
alpha <- 1
beta <- -1
y <- matrix(rbinom(G * N, 1, inv_logit(alpha + beta * as.vector(x))), N, G)

df <- data.frame()
for (g in 1:G) {
  fit <- sampling(model, data = list(N = N, x = x[, g], y = y[ ,g]), refresh = 0)
  ss <- extract(fit)
  df <- rbind(df, data.frame(alpha = ss$alpha, beta = ss$beta,
                             run =  rep(paste("run", g), 4000)))
big_fit <-
           data = list(N = G * N, x = as.vector(x), y = as.vector(y)),
	   refresh = 0)
ss <- extract(big_fit)
df <- rbind(df, data.frame(alpha = ss$alpha, beta = ss$beta,
                             run =  rep("combined", 4000)))

plot <-
  ggplot(df, aes(x = alpha, y = beta)) +
  facet_wrap(run ~ .) +
  geom_hline(yintercept = beta, size = 0.25, color = "red") +
  geom_vline(xintercept = alpha, size = 0.25, color = "red") +
  geom_point(size = 0.1, alpha = 0.1) +
  scale_x_continuous(lim = c(-0.5, 2.5), breaks = c(0, 1, 2)) +
  scale_y_continuous(lim = c(-4, 2), breaks = c(-4, -1, 2))