Hi everyone. My question is about generating the probabilities that are to be reweighted in the context of a MRP analysis. I’m not sure if my confusion is about the MRP process or the technical aspects of Stan, but I would appreciate any insights.
Context
For context, I have survey responses from students at a few schools. As is common with surveys, only a small percentage of those asked replied. But because we have administrative data, we have enough demographic information on both the respondents and the population of interest to poststratify the responses back to our intended population. The survey consisted of a number of questions with a “choose all that apply to you” options. We treat each question option as a single 0/1 outcome to which we fit a multilevel model and poststratify.
The Stan model code is below. We use a QR reparameterization, collapse the 0/1 outcomes into binomial clicks/views within 1st-level demographic cell, and center/scale the RHS variables (x
and z
) to speed up computation.
After fitting the models, we use R to recover the model parameters, multiply by a design matrix, and inv-logit transform to generate predicted probabilities of selecting the answer within each demographic cell (gender, 3 categories of age, and 4 race/ethnicity categories). The head of the design matrix looks like this, where int
is aligned with the model’s alpha
parameter:
int female age_m age_h black hispc orace
[1,] 1 0 0 0 1 0 0
[2,] 1 0 0 0 0 1 0
[3,] 1 0 0 0 0 0 1
[4,] 1 0 0 0 0 0 0
[5,] 1 0 1 0 1 0 0
...
We then reweight with cell population counts per MRP process:
\theta_{pop} = \frac{\sum \pi_jN_j}{\sum N_j}
Question
Based on the way we’ve parameterized the model,
// 2nd level (school-level)
alpha ~ normal(z * gamma + mu_alpha, sigma_alpha);
// 1st level (student-level cells)
clicks ~ binomial_logit(views, Qx * theta + alpha[school]);
should we include the 2nd-level parameters (gamma
) and 2nd-level covariates in the design matrix when computing the cell probabilities? Or should we only include alpha
and beta
, the 1st-level parameters as I’ve shown above?
I’ve done it both ways and the posterior credible intervals are much wider when including the 2nd-level parameters and covariates, which is contrary to my understanding of the benefits of 2nd-level predictors in a MRP analysis.
Again, thanks for any help you can offer.
Stan code
data {
int<lower=1> N; // rows in 1st-level matrix
int<lower=1> K; // cols in 1st-level matrix
int<lower=1> J; // rows in 2nd-level matrix
int<lower=1> L; // cols in 2nd-level matrix
int views[N]; // number of cell members who saw question
int clicks[N]; // number of cell members who clicked box
int school[N]; // school indicators (at 1st level)
matrix[N,K] x; // 1st-level matrix (cells)
matrix[J,L] z; // 2nd-level matrix (school-level covariates)
}
transformed data {
// Using the QR decomposition for 1st-level (x),
// but not 2nd-level (z) b/c there are more columns than rows
// set up matrices for {Q}, {R}, and {R}^-1
matrix[N,K] Qx;
matrix[K,K] Rx;
matrix[K,K] Rxi;
// thin and scale the QR decompositions
Qx = qr_thin_Q(x) * sqrt(N - 1);
Rx = qr_thin_R(x) / sqrt(N - 1);
Rxi = inverse(Rx);
}
parameters {
real mu_alpha; // 2nd-level
real<lower=0> sigma_alpha; // 2nd-level
vector[J] alpha; // J intercepts for schools
vector[L] gamma; // L second-level parameters
vector[K] theta; // K first-level parameters
}
model {
// Priors:
mu_alpha ~ normal(0,1);
gamma ~ normal(0,1);
theta ~ normal(0,1);
sigma_alpha ~ cauchy(0,2.5);
// Likelihood 2nd level: intercepts distributed normally as function of
alpha ~ normal(z * gamma + mu_alpha, sigma_alpha);
// Likelihood 1st level: binomial (clicks/views, prob) with logit link;
clicks ~ binomial_logit(views, Qx * theta + alpha[school]);
}
generated quantities {
// recover beta
vector[K] beta; // init vectorized parameters
beta = Rxi * theta; // \beta = {R}^-1\theta
}