MRP: use 2nd-level parameters when generating probabilities?

Hi everyone. My question is about generating the probabilities that are to be reweighted in the context of a MRP analysis. I’m not sure if my confusion is about the MRP process or the technical aspects of Stan, but I would appreciate any insights.

Context

For context, I have survey responses from students at a few schools. As is common with surveys, only a small percentage of those asked replied. But because we have administrative data, we have enough demographic information on both the respondents and the population of interest to poststratify the responses back to our intended population. The survey consisted of a number of questions with a “choose all that apply to you” options. We treat each question option as a single 0/1 outcome to which we fit a multilevel model and poststratify.

The Stan model code is below. We use a QR reparameterization, collapse the 0/1 outcomes into binomial clicks/views within 1st-level demographic cell, and center/scale the RHS variables (x and z) to speed up computation.

After fitting the models, we use R to recover the model parameters, multiply by a design matrix, and inv-logit transform to generate predicted probabilities of selecting the answer within each demographic cell (gender, 3 categories of age, and 4 race/ethnicity categories). The head of the design matrix looks like this, where int is aligned with the model’s alpha parameter:

      int female age_m age_h black hispc orace
 [1,]   1      0     0     0     1     0     0
 [2,]   1      0     0     0     0     1     0
 [3,]   1      0     0     0     0     0     1
 [4,]   1      0     0     0     0     0     0
 [5,]   1      0     1     0     1     0     0
 ...

We then reweight with cell population counts per MRP process:

\theta_{pop} = \frac{\sum \pi_jN_j}{\sum N_j}

Question

Based on the way we’ve parameterized the model,

// 2nd level (school-level)
alpha ~ normal(z * gamma + mu_alpha, sigma_alpha);
// 1st level (student-level cells)
clicks ~ binomial_logit(views, Qx * theta + alpha[school]);

should we include the 2nd-level parameters (gamma) and 2nd-level covariates in the design matrix when computing the cell probabilities? Or should we only include alpha and beta, the 1st-level parameters as I’ve shown above?

I’ve done it both ways and the posterior credible intervals are much wider when including the 2nd-level parameters and covariates, which is contrary to my understanding of the benefits of 2nd-level predictors in a MRP analysis.

Again, thanks for any help you can offer.

Stan code

data {
  int<lower=1> N; // rows in 1st-level matrix
  int<lower=1> K; // cols in 1st-level matrix
  int<lower=1> J; // rows in 2nd-level matrix
  int<lower=1> L; // cols in 2nd-level matrix
  int views[N];   // number of cell members who saw question
  int clicks[N];  // number of cell members who clicked box
  int school[N];  // school indicators (at 1st level)
  matrix[N,K] x;  // 1st-level matrix (cells)
  matrix[J,L] z;  // 2nd-level matrix (school-level covariates)
}
transformed data {
  // Using the QR decomposition for 1st-level (x), 
  // but not 2nd-level (z) b/c there are more columns than rows

  // set up matrices for {Q}, {R}, and {R}^-1
  matrix[N,K] Qx;
  matrix[K,K] Rx;
  matrix[K,K] Rxi;

  // thin and scale the QR decompositions
  Qx = qr_thin_Q(x) * sqrt(N - 1);
  Rx = qr_thin_R(x) / sqrt(N - 1);
  Rxi = inverse(Rx);
}
parameters {
  real mu_alpha;             // 2nd-level 
  real<lower=0> sigma_alpha; // 2nd-level
  vector[J] alpha;	         // J intercepts for schools
  vector[L] gamma;           // L second-level parameters
  vector[K] theta;           // K first-level parameters
}
model {
  // Priors:
  mu_alpha ~ normal(0,1);
  gamma ~ normal(0,1);
  theta ~ normal(0,1);
  sigma_alpha ~ cauchy(0,2.5);
  
  // Likelihood 2nd level: intercepts distributed normally as function of
  alpha ~ normal(z * gamma + mu_alpha, sigma_alpha);

  // Likelihood 1st level: binomial (clicks/views, prob) with logit link; 
  clicks ~ binomial_logit(views, Qx * theta + alpha[school]);
}
generated quantities {
  // recover beta
  vector[K] beta;      // init vectorized parameters
  beta = Rxi * theta;  // \beta = {R}^-1\theta
}

Question

1 Like

Sorry, it looks like your question fell through a bit. Did you manage to resolve it? Maybe @lauren is not busy and can help?

No worries. Thanks for checking back.

Thinking and reading some more, I think that if I use the Stan model I include above, in particular the following setup:

// 2nd level (school-level)
alpha ~ normal(z * gamma + mu_alpha, sigma_alpha);
// 1st level (student-level cells)
clicks ~ binomial_logit(views, Qx * theta + alpha[school]);

then I should not include second-level covariates when generating cell probabilities from the posterior parameters: only the first-level cell indicators and the school-specific alpha intercept should be used.

That said, I’ve since changed my Stan model to the following:

data {
  int<lower=1> N; // rows in data matrix
  int<lower=1> K; // cols in data matrix
  int<lower=1> J; // number of schools
  int views[N];   // number of cell members who saw the question
  int clicks[N];  // number of cell members who clicked box
  int school[N];  // school indicators (at 1st level)
  matrix[N,K] x;  // data matrix (1st-level cells and 2nd-level covariates)
}
parameters {
  real alpha_mu;
  real<lower=0> alpha_sigma;
  vector[J] alpha_raw;
  real beta_mu;
  real<lower=0> beta_sigma; 
  vector[K] beta_raw;
}
transformed parameters {
  vector[J] alpha;
  vector[K] beta;
  beta = beta_raw * beta_sigma + beta_mu;
  alpha = alpha_raw * alpha_sigma + alpha_mu;
}
model {
  alpha_raw ~ std_normal();
  beta_raw ~ std_normal();
  clicks ~ binomial_logit(views, x * beta + alpha[school]);
}

In addition to using the non-centered parameterization, this new code uses the 2nd-level covariates directly to help model the outcome, rather than in the indirect way I modeled it before (where it was only helping determine the intercept).

// all covariates are in x now: 1st-level indicators & 2nd-level school values
clicks ~ binomial_logit(views, x * beta + alpha[school]);

The model converges better with this code (divergences almost entirely gone now). My understanding is that with this new setup, I should include 2nd-level covariates in my design matrix when generating predicted probabilities.

1 Like

Hey welcome to community, sorry I didn’t see this when you first posted!

Do you mind sharing some extra details about the survey design (i.e., is it a stratified cluster sample?) and the population of interest (e.g., all schools in the world, all schools in the district, schools that were included in your sample, one school in your sample)? I think your question is about MRP specifically so if you’re comfortable posting some more details I’m happy to help! :)

1 Like

Thanks, @lauren, and no worries. I appreciate your help.

For more background: the survey was sent to students at 5 colleges. The target population was made up of students who had left college (not enrolled for a semester or two) despite having made substantial progress towards a degree, had above a 2.0 GPA, and no holds on their accounts such that they were ineligible for re-enrollment. The survey was sent to all students who met these criteria, ~ 30k students, which is to say the full population of such students at these institutions. Only ~ 2k responded and ~ 1700 fully completed the survey.

Our MRP goals right now are pretty modest. Since we have student-level administrative data — which includes basic demographic data — for all the students to whom we sent the survey, we’d like to post-stratify the 1,700 responses to the initial ~ 30k that we asked. We hope this will give us a more representative sample among the 5 colleges. The schools in our sample educate a fair number of students in the state, but any statements about external validity beyond the five schools — to the state as a whole or the nation at-large — are beyond our current MRP goals. We also plan to show results at the school level, but that’s more to show what the method can do (which might be useful in a researcher-practitioner partnership) rather than to be of substantive interest to the general reader.

If it helps, our second-level characteristics are contextual effects per Bafumi and Gelman — the means of our first-level predictors at each school (proportion of women at the school who responded to the question, for example) — the 150% graduation rate at the school, and the average weekly wage for the county in which the school is located.

1 Like

Thanks for the extra information on the design. The first thing I’d say is I think what you summarize in your second comment is right about when you would and wouldn’t include those covariates in the design matrix. Lax and Phillips (2009) have a similar approach to your first model (http://www.columbia.edu/~jrl2124/Lax_Phillips_Gay_Policy_Responsiveness_2009.pdf, look in the appendix). The reason why they do it this way is religion isn’t known at the population level. If you do know your second level predictors than I can’t see the harm in including it in the design matrix.

More generally on your MRP approach, it seems like a good application! One thing that I would suggest is to look at your student-level administrative data. You can include things in your poststrat/design matrix beyond demographics if you know it in the target population. You might want to consider binning GPA and adding that to your design matrix - I don’t know but I would guess that GPA effects survey response and would presumably related to most things you’d ask in a student population. :)

Let me know if you have any other questions! :)

2 Likes

@Lauren, that’s very helpful, thank you. I’ve seen the Lax and Phillips article before, but it’s been a minute so I’ll give it another closer read. And per your suggestion, I’m not sure that we have consistent enough data across schools to add more than demographics to the first level, but I’ll double check. Thanks again.

1 Like