Fitdistr and resample as a strategy for overlarge data sets

ericbarnhill · July 4, 2017, 12:59pm

I am running a mixed model with stan_glmer . Everything seems to be working quite well except that my original data set of 1.2 million data points appears to be far too many. Looking around I see some papers (e.g. http://www.stat.columbia.edu/~gelman/research/unpublished/comp7.pdf ) suggesting “sampling” of the data, as in randomly partitioning the data into subsets, then re-combining them to get a single posterior inference.

I am considering a simpler strategy. The data consists of 100 groups (nested in various ways) with around 12,000 data points each. Each group fits a gamma distribution well. Therefore I would rather just:

Estimate the distribution of each of the 100 groups
Make 1000 random draws from the distribution for each group with rgamma
Run this “resampled” group through Stan, which seems to be computationally feasible

Relevant code is (for GROUP_SIZE=1000)

est_dist <- fitdistr(data$y, "gamma")
y_resamp <- rgamma(GROUP_SIZE, est_dist$estimate[1], est_dist$estimate[2])

I would be thrilled if anyone can comment on:

-whether this should be a feasible strategy?
-whether such practices are found in the literature?
-whether I should do any additional posterior checks to make sure this sample process doesn’t corrupt the data?

Andre_Pfeuffer · July 4, 2017, 2:16pm

I want to add “If you have huge amount data, what benefits gives us the Bayesian approach?”

ericbarnhill · July 4, 2017, 2:18pm

Distributions rather than pointwise estimates? Flexible use of alternative distributions? Credible intervals? Using posterior draws in subsequent processing?

Andre_Pfeuffer · July 4, 2017, 2:28pm

Sure all true, but do we need a “prior distribution”? Doesn’t speak the data itself?

ericbarnhill · July 4, 2017, 2:32pm

Two things:

There is always a prior distribution when using statistical inference, it is just implicit in some methods. Good to be very aware of the assumptions of every method, whether explicitly specified or not.
My model is multilevel and on some of the higher levels there are only a few groups, so at that point assumptions can be quite important

avehtari · July 5, 2017, 8:03am

Check this paper https://arxiv.org/abs/1412.4869 for one possible approach and references for other potentially useful methods.

Aki

Bob_Carpenter · July 5, 2017, 8:56pm

Stan only requires the posterior to be proper. The situation @ericbarnhill is talking about is where we almost always want to put down a hierarchical model (low counts among exchangeable groups).

For a fixed model, as your data size goes to infinity, the posterior converges to a delta function (there are some assumptions in there about parameters not also growing and compatibility with the prior to ensure concentration of measure).

Topic		Replies	Views
Updating Posteriors in Light of More Data Algorithms	4	723	August 30, 2023
Sample from posterior - with or without replacement? General	7	1089	June 2, 2020
Posterior fit retrieval? RStan	1	618	October 12, 2017
Low effective sample size after running Bayesian cognitive model in Stan Modeling rstan , fitting-issues	8	745	August 18, 2021
Proposal: including a "canary" variable to illustrate poor exploration of the posterior General techniques	11	807	June 15, 2020

Fitdistr and resample as a strategy for overlarge data sets

Related topics