I was curious about how the posterior would differ between enforcing the sum-to-zero constraint and not. So I’m going to do a little simulation, because it’s easier than doing real math.
Simple hierarchical model
Let’s consider a simple, canonical random-effects model for binary data organized into K groups.
\mu \sim \textrm{normal}(0, 1)
\\
\sigma \sim \textrm{lognormal}(0, 1)
\\
\alpha_k \sim \textrm{normal}(\mu, \sigma)
\\
y_{k, n} \sim \textrm{Bernoulli}(\textrm{logit}^{-1}(\alpha_k))
This is essentially the same model as I used in my case study, Hierarchical Partial Pooling for Repeated Binary Trials. If you’re not familiar with these models, you might also want to check out Betancourt and Girolami’s paper, Hamiltonian Monte Carlo for Hierarchical Models, which discusses what all the fuss is about.
An alternative formulation of the same model keeps the same priors on \mu and \sigma, but moves the global mean parameter \mu from acting as the prior location parameter to an additive term in the sampling distribution.
\alpha_k \sim \textrm{normal}(0, \sigma)
\\
y_{k, n} \sim \textrm{Bernoulli}(\textrm{logit}^{-1}(\mu + \alpha_k))
It’s an alternative formulation of the same model in the sense that the posteriors on \alpha, \mu, and \sigma will be identical.
Stan code
There are a lot of ways to code this model in Stan. Here’s the traditional parameterization that codes the first model pretty much as written.
data {
int<lower=0> K;
int<lower=0> N;
array[K, N] int<lower=0, upper=1> y;
}
parameters {
real mu;
real<lower=0> sigma;
vector<offset=mu, multiplier=sigma>[K] alpha;
}
model {
alpha ~ normal(mu, sigma);
mu ~ normal(0, 1);
sigma ~ lognormal(0, 1);
for (k in 1:K)
y[k] ~ bernoulli_logit(alpha[k]);
}
The variable alpha
is declared to have an offset of mu
and multiplier of sigma
, which provides a non-centered parameterization under the hood.
In contrast, here’s a slightly different, but closely related model that uses only K parameters instead of K + 1 by setting
\alpha_K = -\textrm{sum}(\alpha_{1:K-1})
and adjusting the prior scale accordingly; see the Stan User’s Guide, section 1.7 on parameterizing centered vectors for more details on the adjustment.
data {
int<lower=0> K;
int<lower=0> N;
array[K, N] int<lower=0, upper=1> y;
}
parameters {
real mu;
real<lower=0> sigma;
vector[K - 1] alpha_pre;
}
transformed parameters {
vector[K] alpha = mu + append_row(alpha_pre, -sum(alpha_pre));
}
model {
alpha ~ normal(0, inv(sqrt(1 - inv(K))) * sigma);
mu ~ normal(0, 1);
sigma ~ lognormal(0, 1);
for (k in 1:K)
y[k] ~ bernoulli_logit(alpha[k]);
}
In both cases, I also snuck in a generated quantities block that reports the posterior for the sample mean and standard deviation of the coefficients \alpha,
generated quantities {
real mean_alpha = mean(alpha);
real<lower=0> sd_alpha = sd(alpha);
}
Simulated data and fit
Now I’m going to simulate data for K = 20 and N = 200 and fit using cmdstanr with the following R script.
library('cmdstanr')
ilogit <- function(u) 1 / (1 + exp(-u))
set.seed(1234)
K <- 20
mu <- 1.3
sigma <- 2.3
alpha <- rnorm(K, mu, sigma)
N <- 200
y <- matrix(NA, K, N)
for (k in 1:K)
y[k, ] <- rbinom(N, 1, ilogit(alpha[k]))
data = list(K = K, N = N, y = y);
mod <- cmdstan_model('hier.stan')
fit <- mod$sample(data = data, chains=2, parallel_chains=4)
modc <- cmdstan_model('hier-ctr.stan')
fitc <- modc$sample(data = data, chains=2, parallel_chains=4)
And here are the results (for seed 1234).
variable mean median sd mad q5 q95 rhat ess_bulk ess_tail
mu 0.53 0.56 0.45 0.41 -0.27 1.24 1.02 130 146
sigma 2.20 2.16 0.38 0.37 1.68 2.92 1.02 152 345
alpha[1] -1.61 -1.61 0.19 0.20 -1.92 -1.30 1.00 2786 1454
alpha[11] 0.29 0.29 0.14 0.14 0.07 0.51 1.00 2684 1168
alpha[20] 5.62 5.49 1.04 1.02 4.16 7.52 1.00 750 1321
mean_alpha 0.69 0.69 0.08 0.08 0.58 0.83 1.00 1141 1266
sd_alpha 2.17 2.15 0.16 0.15 1.93 2.46 1.00 990 1332
variable mean median sd mad q5 q95 rhat ess_bulk ess_tail
mu 0.69 0.68 0.08 0.08 0.56 0.82 1.01 575 859
sigma 2.22 2.17 0.41 0.37 1.66 2.98 1.00 1585 1202
alpha[1] -1.62 -1.62 0.19 0.19 -1.95 -1.32 1.00 3043 1547
alpha[11] 0.28 0.28 0.14 0.14 0.05 0.51 1.00 2892 1550
alpha[20] 5.64 5.50 1.12 1.07 4.11 7.69 1.00 755 1064
mean_alpha 0.69 0.68 0.08 0.08 0.56 0.82 1.01 575 859
sd_alpha 2.17 2.15 0.17 0.16 1.93 2.48 1.00 818 1048
All of the parameters other than \mu have indistinguishable posteriors. But \mu is both larger and has much smaller posterior 95% intervals in the second model.
Did we just get lucky? Let’s try a different seed (123456).
> prt(fit)
variable mean median sd mad q5 q95 rhat ess_bulk ess_tail
mu 1.90 1.90 0.49 0.48 1.06 2.66 1.00 250 412
sigma 2.31 2.25 0.46 0.39 1.69 3.14 1.01 303 594
alpha[1] 3.90 3.87 0.49 0.47 3.16 4.74 1.00 2530 1208
alpha[11] -0.94 -0.94 0.16 0.15 -1.20 -0.68 1.00 2193 1619
alpha[20] 2.60 2.59 0.29 0.30 2.16 3.08 1.00 2625 1474
mean_alpha 2.38 2.37 0.13 0.12 2.18 2.60 1.00 1477 1356
sd_alpha 2.24 2.21 0.20 0.17 1.97 2.62 1.00 1226 1259
> prt(fitc)
variable mean median sd mad q5 q95 rhat ess_bulk ess_tail
mu 2.38 2.37 0.13 0.13 2.19 2.62 1.01 377 584
sigma 3.22 3.15 0.57 0.53 2.41 4.22 1.00 1047 1246
alpha[1] 3.86 3.81 0.48 0.47 3.16 4.66 1.00 1495 1416
alpha[11] -0.95 -0.95 0.16 0.16 -1.22 -0.69 1.00 3066 1528
alpha[20] 2.60 2.58 0.29 0.27 2.15 3.12 1.00 1971 818
mean_alpha 2.38 2.37 0.13 0.13 2.19 2.62 1.01 377 584
sd_alpha 2.28 2.24 0.23 0.20 2.00 2.72 1.01 348 419
So we can conclude that @betanalpha is right—changing the model to use a sum-to-zero constraint does change the posterior. And arguably the non-identifiable model is the one we want, because as @betanalpha said, this is the one where we can naturally make posterior predictions for new groups. Making predictions for new groups is important. There might not be new states in a political model, but there are new testing sites for Covid modeling, and Andrew and I used posterior predictions for new groups in our Covid test sensitivity and specificity paper.
Help with intuitions?
I’d like to develop some intuition around why these results are so different. Of course, they have completely different correlation structure, but then so does the traditional centered vs. non-centered parameterization (though in that case you can easily show the transformed variables have the same distributions). What I’m really surprised about is that the posteriors for the coefficients \alpha_k don’t change even though the posteriors on \mu and \sigma are so different.
Given the posteriors for \alpha were indistinguishable between the two models, it’s not surprising that the sample mean and standard deviations were identical. But I was surprised that \mu, \sigma didn’t closely track the sample mean and standard deviation of \alpha in either model. With both seeds, the value of \mu matches the sample mean of \alpha_{1:K} in the second model. But the standard deviations are way off in both cases.