I am working on lots of simulations and here’s a recent plot I came up with.
Subsampling data and fitting each subsample separately leaves you with draws that are far away from the data resulting from the posterior that combines all the data. (First 15 plots) Posteriors for 15 simulated data sets of size N = 50 for a logistic regression model with intercept (alpha) and single predictor (beta) with weakly informative priors. (Final plot) Posterior resulting from combining all 15 simulated data sets for a data set of size N = 750.The Model
-
y_n \sim \mbox{bernoulli}(\mbox{logit}^{-1}(\alpha + \beta \cdot x_n)
-
\alpha, \beta \sim \mbox{normal}(0, 2),
with parameters
- \alpha = 1 and \beta = -1.
and data simulated as
- x_n \sim \mbox{normal}(0, 1).
In-depth discussion
This was motivated by theoretical discussions here:
- Michael Betancourt. The Fundamental Incompatibility of Hamiltonian Monte Carlo and Data Subsampling.
and the computational issue illustrated in Figure 1 here: