I am running a mixed model with stan_glmer . Everything seems to be working quite well except that my original data set of 1.2 million data points appears to be far too many. Looking around I see some papers (e.g. http://www.stat.columbia.edu/~gelman/research/unpublished/comp7.pdf ) suggesting “sampling” of the data, as in randomly partitioning the data into subsets, then re-combining them to get a single posterior inference.
I am considering a simpler strategy. The data consists of 100 groups (nested in various ways) with around 12,000 data points each. Each group fits a gamma distribution well. Therefore I would rather just:
- Estimate the distribution of each of the 100 groups
- Make 1000 random draws from the distribution for each group with rgamma
- Run this “resampled” group through Stan, which seems to be computationally feasible
Relevant code is (for GROUP_SIZE=1000)
est_dist <- fitdistr(data$y, "gamma")
y_resamp <- rgamma(GROUP_SIZE, est_dist$estimate[1], est_dist$estimate[2])
I would be thrilled if anyone can comment on:
-whether this should be a feasible strategy?
-whether such practices are found in the literature?
-whether I should do any additional posterior checks to make sure this sample process doesn’t corrupt the data?