Fitdistr and resample as a strategy for overlarge data sets

I am running a mixed model with stan_glmer . Everything seems to be working quite well except that my original data set of 1.2 million data points appears to be far too many. Looking around I see some papers (e.g. http://www.stat.columbia.edu/~gelman/research/unpublished/comp7.pdf ) suggesting “sampling” of the data, as in randomly partitioning the data into subsets, then re-combining them to get a single posterior inference.

I am considering a simpler strategy. The data consists of 100 groups (nested in various ways) with around 12,000 data points each. Each group fits a gamma distribution well. Therefore I would rather just:

  1. Estimate the distribution of each of the 100 groups
  2. Make 1000 random draws from the distribution for each group with rgamma
  3. Run this “resampled” group through Stan, which seems to be computationally feasible

Relevant code is (for GROUP_SIZE=1000)

est_dist <- fitdistr(data$y, "gamma")
y_resamp <- rgamma(GROUP_SIZE, est_dist$estimate[1], est_dist$estimate[2])

I would be thrilled if anyone can comment on:

-whether this should be a feasible strategy?
-whether such practices are found in the literature?
-whether I should do any additional posterior checks to make sure this sample process doesn’t corrupt the data?

I want to add “If you have huge amount data, what benefits gives us the Bayesian approach?”

Distributions rather than pointwise estimates? Flexible use of alternative distributions? Credible intervals? Using posterior draws in subsequent processing?

Sure all true, but do we need a “prior distribution”? Doesn’t speak the data itself?

Two things:

  1. There is always a prior distribution when using statistical inference, it is just implicit in some methods. Good to be very aware of the assumptions of every method, whether explicitly specified or not.

  2. My model is multilevel and on some of the higher levels there are only a few groups, so at that point assumptions can be quite important

Check this paper https://arxiv.org/abs/1412.4869 for one possible approach and references for other potentially useful methods.

Aki

Stan only requires the posterior to be proper. The situation @ericbarnhill is talking about is where we almost always want to put down a hierarchical model (low counts among exchangeable groups).

For a fixed model, as your data size goes to infinity, the posterior converges to a delta function (there are some assumptions in there about parameters not also growing and compatibility with the prior to ensure concentration of measure).