I’m trying to fit a multilevel regression model. The problem I’ve encountered can be put, I believe, in a very simple framework. Consider the model
y_{ij}|\theta_j, \sigma_\epsilon \sim \text{Normal}(\theta_j, \sigma_\epsilon)
\theta_j |\mu,\sigma_\theta \sim \text{Normal}(\mu,\sigma_\theta)
with j=1,2,...,J groups and i=1,2,...,n_j individuals within groups. What I’ve figured out is that the posterior variance of \mu differs substantively across different parameterizations of the model.
For example, I get a different posterior variance for \mu
-
when using noncentered parameterization for \theta_j as
\theta_j = \mu + \sigma_\theta * \theta_j^*
with \theta_j^* \sim \text{Normal}(0,1) -
or when I, in addition, use hard constraints on \theta_j^*, to center \theta_J^* at each iteration of the MC chains to have a mean of zero, i.e.,
\theta_j = \mu + \sigma_\theta * \tilde{\theta_j}^*
where \tilde{\theta_j}^* = \theta_j^* - \frac{1}{J}\sum_{j=1}^J \theta_j^* and \theta_j^* \sim \text{Normal}(0,1).
As I was not sure, I’ve also run the three different parameterizations on some simulated data and compared them. The STAN
code is given below:
Model 1 (centered parameterization):
data {
int N;
int J;
int jj[N];
real y[N];
}
parameters {
real mu;
real<lower=0> sigma;
real theta[J];
real<lower=0> sigma_e;
}
transformed parameters {
real xb[N];
for (i in 1:N)
xb[i] = theta[jj[i]];
}
model {
mu ~ normal(0,10);
sigma ~ cauchy(0,5);
theta ~ normal(mu, sigma);
sigma_e ~ cauchy(0,3);
y ~ normal(xb, sigma_e);
}
generated quantities {
real theta_mean;
theta_mean = mean(theta);
}
Model 2 (non-centered parameterization):
data {
int N;
int J;
int jj[N];
real y[N];
}
parameters {
real mu;
real<lower=0> sigma;
real theta_star[J];
real<lower=0> sigma_e;
}
transformed parameters {
real xb[N];
real theta[J];
for (j in 1:J)
theta[j] = mu + sigma * theta_star[j];
for (i in 1:N)
xb[i] = theta[jj[i]];
}
model {
mu ~ normal(0,10);
sigma ~ cauchy(0,5);
theta_star ~ normal(0, 1);
sigma_e ~ cauchy(0,3);
y ~ normal(xb, sigma_e);
}
generated quantities {
real theta_mean;
theta_mean = mean(theta);
}
Model 3 (non-centered parameterization with hard constraint on mean):
data {
int N;
int J;
int jj[N];
real y[N];
}
parameters {
real mu;
real<lower=0> sigma;
real theta_star[J];
real<lower=0> sigma_e;
}
transformed parameters {
real xb[N];
real theta[J];
{
real tmp_mean;
tmp_mean = mean(theta_star);
for (j in 1:J)
theta[j] = mu + sigma * (theta_star[j] - tmp_mean);
}
for (i in 1:N)
xb[i] = theta[jj[i]];
}
model {
mu ~ normal(0,10);
sigma ~ cauchy(0,5);
theta_star ~ normal(0, 1);
sigma_e ~ cauchy(0,3);
y ~ normal(xb, sigma_e);
}
generated quantities {
real theta_mean;
theta_mean = mean(theta);
}
When looking at the posterior mean and standard deviation of \mu, one finds that the standard deviation is much smaller in the model with hard constraints, while the posterior mean is almost identical:
par mean sd 25% 75% n_eff Rhat model
1: mu 2.968 0.12032 2.888 3.046 4000 0.9993 1
2: mu 2.964 0.12198 2.883 3.043 102 1.0230 2
3: mu 2.966 0.01052 2.958 2.973 1824 1.0014 3
4: mu 2.966 0.01061 2.958 2.973 2288 1.0015 4
5: theta_mean 2.966 0.01064 2.958 2.973 4000 0.9994 1
6: theta_mean 2.966 0.01041 2.959 2.972 4000 1.0004 2
7: theta_mean 2.966 0.01052 2.958 2.973 1824 1.0014 3
8: theta_mean 2.966 0.01061 2.958 2.973 2288 1.0015 4
On the other hand, the posterior distribution of theta_mean
, i.e.,
\bar\theta = \frac{1}{J} \sum_{j=1}^J\theta_j
which was sampled using the generated quantities
block has a very similar distribution as \mu in Model 3, which uses hard constraints.
I’ve also checked the distribution of \sigma_\epsilon, \sigma_\theta, and randomly sampled \theta_j, which don’t show any noticeable differences across the models. (Yet, if I additionally put a hard constraint on the variance of \theta_j^*, the estimated posterior variance of \sigma_\theta is much smaller compared to models in which \theta_j^* is given a standard normal prior without hard constraints.)
For example, for 5 randomly sampled \theta_j's, I get
par mean sd 25% 75% n_eff Rhat model
1: theta[10] 4.326 0.07367 4.277 4.374 4000 0.9998 1
2: theta[10] 4.327 0.07379 4.279 4.376 4000 0.9999 2
3: theta[10] 4.326 0.07181 4.279 4.376 4000 0.9995 3
4: theta[1] 2.811 0.07167 2.762 2.859 4000 1.0003 1
5: theta[1] 2.812 0.07092 2.764 2.859 4000 0.9993 2
6: theta[1] 2.811 0.07090 2.764 2.858 4000 0.9998 3
7: theta[22] 2.543 0.07591 2.493 2.595 4000 1.0006 1
8: theta[22] 2.544 0.07480 2.494 2.593 4000 1.0002 2
9: theta[22] 2.546 0.07564 2.494 2.598 4000 0.9999 3
10: theta[38] 3.668 0.07306 3.618 3.717 4000 0.9992 1
11: theta[38] 3.666 0.07577 3.615 3.717 4000 0.9997 2
12: theta[38] 3.667 0.07532 3.616 3.717 4000 0.9997 3
13: theta[6] 2.395 0.06903 2.348 2.442 4000 0.9996 1
14: theta[6] 2.395 0.06818 2.349 2.440 4000 0.9998 2
15: theta[6] 2.396 0.07011 2.349 2.443 4000 0.9999 3
Thus, it’s really just the posterior of the hyperparameters that is affected.
I am quite confused as, depending on which model I use, the substantive conclusion regarding \mu might change.
Has anyone an idea of why these models differ so much in \sigma_\theta? Is it because the “soft” standard normal prior only “weakly” identifies the model? Or is the case that we should interpret \mu differently in models with hard constraints?
Thank you in advance.
The full R
code of the simulations is attached.
example.R (5.6 KB)
Further Results
I’ve also looked into the unnormalized log-posterior (lp__) and the log-likelihood. The log-posterior differs, understandably, across non-centered and centered parameterizations but not for the non-centered models with/without mean constraint. The log-likelihood, computed by adding to the generated quantities
block
generated quantities {
real ll;
{
real ll_i[N];
for (i in 1:N)
ll_i[i] = normal_lpdf(y[i] | xb[i], sigma_e);
ll = sum(ll_i);
}
}
shows on noticeable differences across the models. The plots are below: