Sum to zero priors with unequally populated categories in MRP

I’m trying to model CCES data (vote outcomes given various demographic variables) using MR (not yet up to the “P” part!) in Stan. Everything works fine–convergence looks good, sampler is fast, parameters make sense, etc.–until I actually introduce the MR bit.

One thing I am confused about in particular is the use of sum-to-zero constraints (enforced via a tight prior around 0 on the sum) on the parameter corresponding to a group. I can see why I’d want this, since otherwise I can shift all group coefficients one way and my intercept the other. But I’m concerned that if the number of counts in each member of a group are not similar then such a constraint is allowing the categories with fewer counts to have more influence than they should? And also, if the sizes are not approximately equal, wouldn’t the constraint fail to solve the non-identifiability issue?

One thing I observe in the model below, is that the sampler is much faster and the chains are mixing better, etc. when the sum-to-zero prior is removed. But in that case I also get divergences. With the constraint, it is slow and the mixing is poor, autocorrelations are large out to many lags, etc. but the divergences go away. In both cases, adding the group has a significant effect on the intercept, which makes me think I still have non-identifiability issues.

Addendum: I tried using a “weighted” sum-to-zero prior by constructing a vector of counts for each category in the grouping, dividing that vector by N and then having the prior be:
dot_product(beta_g, weights_g) ~ normal(0, 1e-3);
This is faster, has better Rhats, and autocorrelations, but has more divergences.

The code is below. Any thoughts about how to diagnose/think about these issues are most welcome!

data {
  int<lower=2> N_State;
  int<lower=1> N;
  int<lower=1> State[N];
  int<lower=2> N_Education;
  int<lower=1> Education[N];
  int<lower=2> N_Race;
  int<lower=1> Race[N];
  int<lower=2> N_CD;
  int<lower=1> CD[N];
  int<lower=2> N_Sex;
  int<lower=1> Sex[N];
  int K_CD;
  matrix[N_CD,K_CD] X_CD;
  int<lower=1> T[N];
  int<lower=0> S[N];
}
transformed data {
  vector[K_CD] mean_X_CD;
  matrix[N_CD,K_CD] centered_X_CD;
  for (k in 1:K_CD)   {
    mean_X_CD[k] = mean(X_CD[,k]);
    centered_X_CD[,k] = X_CD[,k] - mean_X_CD[k];
  }
  matrix[N_CD,K_CD] Q_CD_ast;
  matrix[K_CD,K_CD] R_CD_ast;
  matrix[K_CD,K_CD] R_CD_ast_inverse;
  Q_CD_ast = qr_thin_Q(centered_X_CD) * sqrt(N_CD - 1);
  R_CD_ast = qr_thin_R(centered_X_CD) / sqrt(N_CD - 1);
  R_CD_ast_inverse = inverse(R_CD_ast);
}
parameters {
  vector[K_CD] thetaX_CD;
  real alpha;
  real eps_Education;
  real<lower=0> sigma_Race;
  vector<multiplier=sigma_Race>[N_Race] beta_Race;
  real eps_Sex;
}
transformed parameters {
  vector[K_CD] betaX_CD;
  betaX_CD = R_CD_ast_inverse * thetaX_CD;
}
model {
  alpha ~ normal(0,1.0);
  thetaX_CD ~ normal(0,1.0);
  eps_Education ~ normal(0, 1.0);
  beta_Race ~ normal(0, sigma_Race);
  sigma_Race ~ normal(0, 2.0);
  sum(beta_Race) ~ normal(0, 1.0e-4);
  eps_Sex ~ normal(0, 1.0);
  S ~ binomial_logit(T, alpha + Q_CD_ast[CD] * thetaX_CD + to_vector({eps_Education, -eps_Education}[Education]) + beta_Race[Race] + to_vector({eps_Sex, -eps_Sex}[Sex]));
}
generated quantities {
  vector<lower=0, upper=1>[N] yPred = inv_logit(alpha + centered_X_CD[CD] * betaX_CD + to_vector({eps_Education, -eps_Education}[Education]) + beta_Race[Race] + to_vector({eps_Sex, -eps_Sex}[Sex]));
  real<lower=0> SPred[N];
  for (n in 1:N)   {
    SPred[n] = yPred[n] * T[n];
  }
  vector [N] log_lik;
  for (n in 1:N)   {
    log_lik[n] = binomial_logit_lpmf(S[n] | T[n], alpha + centered_X_CD[CD][n] * betaX_CD + {eps_Education, -eps_Education}[Education[n]] + beta_Race[Race[n]] + {eps_Sex, -eps_Sex}[Sex[n]]);
  }


}

Is there any particular reason you want to use a sum to zero constraint? Varying intercepts should be identified with a weakly informative prior unless they’re in some latent space.

Also with MRP, it’s a lot easier to do this with rstanarm/brms, especially as there are existing tutorials for how to re-weight predictions with these packages and because these packages have very efficient implementations of varying intercepts/slopes. Doing it the hard way is only necessary if you want the learning experience or you’re planning on implementing some new model that these packages can’t do.

1 Like

Thanks for the reply!

I added the sum-to-zero constraint because without it I thought I had a non-identifiably issue and it was suggested in the Stan docs about MRP.
Am I misunderstanding the suggestion in the docs? And, if a sum-to-zero prior is important, does it matter if the number of counts in each category is very unequal?

I’m writing this Stan code (and the JSON) from Haskell, and trying to avoid a runtime dependence on R. So I could use brms or rstanarm to figure things out, and then port it back to the Haskell model-writing bit.

But ultimately I’d like to understand well enough to do it directly. And learn how to diagnose these things.

Update: I did more experimentation. Weighting the sum-to-zero constraint seems to help some but I still get divergences. Without it, I don’t get divergences but the chains don’t mix well, I have significant autocorrelations out to very large lags and my Rhats are all > 1.1. With that constraint, and depending on its strength (the smallness of the standard-dev of the prior), all those problems go away but I have divergences. Raising adapt_delta helps somewhat but I still get some divergences.

More updates: The parameterization of the group coefficients was always non-centered (or scaled at least). But switching to doing it the “old” way–using transformed parameters–improved things quite a bit. I am still getting 1 to 5 divergences and I still don’t understand them, but that’s better than 75 or 900. Also, equally inexplicably, things were worse when I had a normal(0, sigma_group) prior on the transformed variable and then much much better when I switched that for a normal (0,1) prior on the “raw” variable.

So, best version now: has the sum-to-zero constraint, weighted by the number of counts in each category divided by the total counts. But that constraint is pretty weak: normal(0,.1). Using old-style non-centering helps, but only if the prior is put on the raw variable, otherwise things are even worse. And raising adapt-delta also helps.