Posterior variance of hyperparameters in multilevel model with hard constraints

I’m trying to fit a multilevel regression model. The problem I’ve encountered can be put, I believe, in a very simple framework. Consider the model

y_{ij}|\theta_j, \sigma_\epsilon \sim \text{Normal}(\theta_j, \sigma_\epsilon)
\theta_j |\mu,\sigma_\theta \sim \text{Normal}(\mu,\sigma_\theta)

with j=1,2,...,J groups and i=1,2,...,n_j individuals within groups. What I’ve figured out is that the posterior variance of \mu differs substantively across different parameterizations of the model.

For example, I get a different posterior variance for \mu

1. when using noncentered parameterization for \theta_j as
\theta_j = \mu + \sigma_\theta * \theta_j^*
with \theta_j^* \sim \text{Normal}(0,1)

2. or when I, in addition, use hard constraints on \theta_j^*, to center \theta_J^* at each iteration of the MC chains to have a mean of zero, i.e.,
\theta_j = \mu + \sigma_\theta * \tilde{\theta_j}^*
where \tilde{\theta_j}^* = \theta_j^* - \frac{1}{J}\sum_{j=1}^J \theta_j^* and \theta_j^* \sim \text{Normal}(0,1).

As I was not sure, I’ve also run the three different parameterizations on some simulated data and compared them. The STAN code is given below:

Model 1 (centered parameterization):

data {
int N;
int J;
int jj[N];
real y[N];
}

parameters {
real mu;
real<lower=0> sigma;
real theta[J];
real<lower=0> sigma_e;
}

transformed parameters {
real xb[N];

for (i in 1:N)
xb[i] = theta[jj[i]];
}

model {

mu ~ normal(0,10);
sigma ~ cauchy(0,5);
theta ~ normal(mu, sigma);
sigma_e ~ cauchy(0,3);
y ~ normal(xb, sigma_e);

}

generated quantities {

real theta_mean;
theta_mean = mean(theta);

}


Model 2 (non-centered parameterization):

data {

int N;
int J;
int jj[N];
real y[N];

}

parameters {

real mu;
real<lower=0> sigma;
real theta_star[J];
real<lower=0> sigma_e;

}

transformed parameters {

real xb[N];
real theta[J];

for (j in 1:J)
theta[j] = mu + sigma * theta_star[j];

for (i in 1:N)
xb[i] = theta[jj[i]];

}

model {

mu ~ normal(0,10);
sigma ~ cauchy(0,5);
theta_star ~ normal(0, 1);
sigma_e ~ cauchy(0,3);
y ~ normal(xb, sigma_e);

}

generated quantities {

real theta_mean;
theta_mean = mean(theta);

}


Model 3 (non-centered parameterization with hard constraint on mean):

data {

int N;
int J;
int jj[N];
real y[N];

}

parameters {

real mu;
real<lower=0> sigma;
real theta_star[J];
real<lower=0> sigma_e;

}

transformed parameters {

real xb[N];
real theta[J];

{
real tmp_mean;

tmp_mean = mean(theta_star);

for (j in 1:J)
theta[j] = mu + sigma * (theta_star[j] - tmp_mean);

}

for (i in 1:N)
xb[i] = theta[jj[i]];

}

model {

mu ~ normal(0,10);
sigma ~ cauchy(0,5);
theta_star ~ normal(0, 1);
sigma_e ~ cauchy(0,3);
y ~ normal(xb, sigma_e);

}

generated quantities {

real theta_mean;
theta_mean = mean(theta);

}


When looking at the posterior mean and standard deviation of \mu, one finds that the standard deviation is much smaller in the model with hard constraints, while the posterior mean is almost identical:

          par  mean      sd   25%   75% n_eff   Rhat model
1:         mu 2.968 0.12032 2.888 3.046  4000 0.9993     1
2:         mu 2.964 0.12198 2.883 3.043   102 1.0230     2
3:         mu 2.966 0.01052 2.958 2.973  1824 1.0014     3
4:         mu 2.966 0.01061 2.958 2.973  2288 1.0015     4
5: theta_mean 2.966 0.01064 2.958 2.973  4000 0.9994     1
6: theta_mean 2.966 0.01041 2.959 2.972  4000 1.0004     2
7: theta_mean 2.966 0.01052 2.958 2.973  1824 1.0014     3
8: theta_mean 2.966 0.01061 2.958 2.973  2288 1.0015     4


On the other hand, the posterior distribution of theta_mean, i.e.,
\bar\theta = \frac{1}{J} \sum_{j=1}^J\theta_j
which was sampled using the generated quantities block has a very similar distribution as \mu in Model 3, which uses hard constraints.

I’ve also checked the distribution of \sigma_\epsilon, \sigma_\theta, and randomly sampled \theta_j, which don’t show any noticeable differences across the models. (Yet, if I additionally put a hard constraint on the variance of \theta_j^*, the estimated posterior variance of \sigma_\theta is much smaller compared to models in which \theta_j^* is given a standard normal prior without hard constraints.)

For example, for 5 randomly sampled \theta_j's, I get

          par  mean      sd   25%   75% n_eff   Rhat model
1: theta[10] 4.326 0.07367 4.277 4.374  4000 0.9998     1
2: theta[10] 4.327 0.07379 4.279 4.376  4000 0.9999     2
3: theta[10] 4.326 0.07181 4.279 4.376  4000 0.9995     3
4:  theta[1] 2.811 0.07167 2.762 2.859  4000 1.0003     1
5:  theta[1] 2.812 0.07092 2.764 2.859  4000 0.9993     2
6:  theta[1] 2.811 0.07090 2.764 2.858  4000 0.9998     3
7: theta[22] 2.543 0.07591 2.493 2.595  4000 1.0006     1
8: theta[22] 2.544 0.07480 2.494 2.593  4000 1.0002     2
9: theta[22] 2.546 0.07564 2.494 2.598  4000 0.9999     3
10: theta[38] 3.668 0.07306 3.618 3.717  4000 0.9992     1
11: theta[38] 3.666 0.07577 3.615 3.717  4000 0.9997     2
12: theta[38] 3.667 0.07532 3.616 3.717  4000 0.9997     3
13:  theta[6] 2.395 0.06903 2.348 2.442  4000 0.9996     1
14:  theta[6] 2.395 0.06818 2.349 2.440  4000 0.9998     2
15:  theta[6] 2.396 0.07011 2.349 2.443  4000 0.9999     3


Thus, it’s really just the posterior of the hyperparameters that is affected.

I am quite confused as, depending on which model I use, the substantive conclusion regarding \mu might change.

Has anyone an idea of why these models differ so much in \sigma_\theta? Is it because the “soft” standard normal prior only “weakly” identifies the model? Or is the case that we should interpret \mu differently in models with hard constraints?

The full R code of the simulations is attached.
example.R (5.6 KB)

Further Results

I’ve also looked into the unnormalized log-posterior (lp__) and the log-likelihood. The log-posterior differs, understandably, across non-centered and centered parameterizations but not for the non-centered models with/without mean constraint. The log-likelihood, computed by adding to the generated quantities block

generated quantities {

real ll;

{
real ll_i[N];
for (i in 1:N)
ll_i[i] = normal_lpdf(y[i] | xb[i], sigma_e);
ll = sum(ll_i);
}

}


shows on noticeable differences across the models. The plots are below:

Where does the mean constraint come from? As far as I can tell, that’s just a different model and so you’d expect the posterior to be different.

Do the regular centered vs. non-centered parameterizations agree? Are you free of divergences and treedepth warnings on both of those parameterizations?

Thank you so much @bbbales2 for the reply!

Where does the mean constraint come from? As far as I can tell, that’s just a different model and so you’d expect the posterior to be different.

Yes, that make sense. I think I was confused by the fact that everything else (posterior distributions of \theta_j, \sigma_\epsilon, or the likelihood etc.) looked the same. The restriction that every draw of \theta_j has to have a mean of zero is not justified in the model I am running and, thus, adds artificial constraints into the model.

Do the regular centered vs. non-centered parameterizations agree? Are you free of divergences and treedepth warnings on both of those parameterizations?

Yes. There are no divergent transition or any transitions that exceed the maximum treedepth. And the models agree on the posterior distribution of the parameters I’ve checked.

Another question that I have is about the theta.mean variable created in the generated quantities block. Suppose I am interested in the mean of the \theta_j's. The model postulates that \mu is the mean of the distribution from which \theta_j comes from. So \mu would be the parameter of interest. On the other hand, we can also calculate \bar \theta = \frac{1}{J}\sum_{j=1}^J \theta_j and generate posterior draws for \bar\theta. When you compare the results, quoted again below, where Model 1 is the centered parameterization and Model 2 the non-centered parameterization (both with no constraints), we see that the \bar \theta and \mu have different distributions.

          par  mean      sd   25%   75% n_eff   Rhat model
1:         mu 2.968 0.12032 2.888 3.046  4000 0.9993     1
2:         mu 2.964 0.12198 2.883 3.043   102 1.0230     2
5: theta_mean 2.966 0.01064 2.958 2.973  4000 0.9994     1
6: theta_mean 2.966 0.01041 2.959 2.972  4000 1.0004     2


Is it correct to interpret this difference analogously to that of a “sample mean” and a “population mean”? That is, the posterior distribution of \mu summarizes the uncertainty we have about the mean of the distribution from which the \theta_j's are drawn. The posterior distribution of \bar\theta, on the other hand, summarizes the uncertainty about the mean of the specific groups j=1,...,J we have in our data?

A similar problem of interpretation arises when, for example, running a multilevel regression model of the form

\pi_{it} = \text{logit}^{-1}(\alpha_i + \beta_i x_{it})
y_{it} \sim \text{Bernoulli}(\pi_{ij})

where i=1,2,...,I stands for individuals and t=1,2,...,T for time, and where \alpha_i \sim N(\mu_\alpha, \sigma_\alpha) and \beta_i \sim N(\mu_\beta, \sigma_\beta). Let t^* be a placeholder for the measured time points, and suppose we want to plot the “average” time trend across individuals.

One way to do so is to calculate \text{logit}^{-1}(\mu_\alpha + \mu_\beta t^*), where accompanying uncertainty estimates of the trends are calculated using joint draws from (\mu_\alpha, \mu_\beta). Another way would be to calculate \frac{1}{I}\sum_{i} \text{logit}^{-1}(\alpha_i + \beta_i t^*) for each posterior draw from (\alpha_1, \beta_1, \alpha_2, \beta_2, ..., \alpha_I, \beta_I) and summarize our uncertainty according to that distribution.

Similarly to the example of the simple ANOVA model, the posterior distributions for these two “averages” would be different. I am a little bit confused about who to interpret the difference in these results and on what approach we should base our inference, when the interest lies in the average trend across individuals. Returning to the analogy between sample and population means, can we say that we would use the first approach when we are assuming that the individuals come from a broader population and we want to make inference regarding that population; whereas in the second approach is adequate when we assume that the data at hand is the population?