Hello,
I tried some of the suggestions and finally found a way forward in this problem. I looked at how Yajuan Si, Rob Trangucci, Jonah Sol Gabry, and Andrew Gelman proceed in their recent paper http://www.stat.columbia.edu/~gelman/research/unpublished/modelweighting.pdf and as you can see (pg21) they proceed in 2 steps:
- They first estimate cell-specific means, and record the size of the cell, the variance, etc.
- They then fit the model using aggregate estimates as imputs.
This is definitely compatible with my approach, since it drastically reduces the dimensionality of my data (from 1e6observations, to between 6000 and 3000 cells).
However, to introduce the cell size and cell variance in the second step, they use the prior_covariance
argument to define random effect, using a function rstanarm::mrp_structured
which I believe is not available in my version of rstanarm, Version 2.15.3, packaged: 2017-04-29 06:18:44 UTC
, which I believe is the latest one.
How could I define a similar covariance matrix? I have tried to look at the rstanarm manual, but could not understand how the prior_covariance
argument is defined.
prior_covariance =
rstanarm::mrp_structured(
cell_size = dat_rstanarm$n,
cell_sd = dat_rstanarm$sd_cell,
group_level_scale = 1,
group_level_df = 1
)
Below I include what my code currently looks like:
Observations: 2,907
Variables: 8
$ PROV <fctr> Álava, Álava, Álava, Álava, Álava, Álava, Álava, Álava, Álava, Álava, Ála...
$ EDAD <fctr> 25-29, 25-29, 25-29, 25-29, 25-29, 25-29, 25-29, 25-29, 30-34, 30-34, 30-...
$ gender_woman <int> 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0,...
$ edu_cis <fctr> 1.primary_or_less, 2.secondary_early, 3.secondary_upper, 4.upper, 1.prima...
$ sd_cell <dbl> 0.0000000, 0.4495387, 0.5030770, 0.3661515, 0.3779645, 0.4961389, 0.426020...
$ n <int> 3, 87, 82, 252, 7, 40, 56, 273, 103, 108, 306, 69, 109, 351, 5, 166, 142, ...
$ Y <dbl> 0.00000000, 0.27586207, 0.50000000, 0.15873016, 0.14285714, 0.40000000, 0....
$ cell_id <chr> "Álava25-2901.primary_or_less", "Álava25-2902.secondary_early", "Álava25-2...
And here is what my first sketch of code looks like without the prior_covariance
argument.
st_1115 <- stan_glmer(formula = Y~ 1 +
(1|EDAD) +
(1|PROV)+
(1|gender_woman)+
(1|edu_cis),
data = df,iter = 1000, chains =2, cores = 2,
prior_aux = cauchy(0,5),
prior_intercept = normal(0,100,autoscale = FALSE)
)
Thank you in advance!
lmg