To average or not to average ‘add_epred_draws’ by draw before computing the parameter of interest by a grouping variable?
I have an aggerated data set Test_data with the following structure
-
Year: 1995- 2000
-
Regions (geographical region): 1 - 10
-
Gender: Female, Male
-
Age_group: 18-29, 30-49, 50+
-
n: number of cases
-
N: denominator
-
LogN: log(N) - a log transformation of the denominator N
Given this data, I fit a negative binomial regression as follow
Fit<- stan_glmer(n ~ logN + Year+ gender + age_group + (1 | Regions),
family = neg_binomial_2(link = "log"),
data = Test_data
iter=4000,
QR=TRUE,
cores = 4,
adapt_delta = 0.99)
I am interested in the Incidence rate, for example, by year, gender, age_group, and region. From the observed data, it’s given by IR = (n/N)*100000.
I have a new data called ‘Newdat’ with the same structure but without n. It only have the denominator N named as n_pop (i.e. this called a poststratification data in the MRP/ “Mister P” literature).
Using the add_epred_draws, I generate the expectation of the posterior predictive distribution given the model and the new data.
What I’m worried about is how I summarise the add_epred_draws result to get ‘marginal’ estimates, say by year and gender. Do I need first to get the mean of the draws and then summarize by group or do I have to summarise the whole draws as bellow. The former seems an artificial reduction of variability in the estimate because we average out the between draws variability and the later leads to a very wide uncertainty interval or I’m completely missing this.
group - c("Year", "gender")
res<- Newdat%>%
add_epred_draws(model, ndraws = 1000)%>%
group_by(across(all_of(group)))%>%
summarise(Rate = 100000*mean (.epred/n_pop),
lower = 100000*quantile(.epred/n_pop, 0.025),
upper = 100000*quantile(.epred/n_pop, 0.975))