Posterior modality for the balanced one-way-random effects model

Hi,
I am trying to understand the posterior modality by replicating Liu and Hodges (2003). The model is as simple as:
y_{i,j}\sim N(\theta_i, \sigma), \quad \theta_i \sim N(\mu, \tau). \quad i=1, \dots, N, j=1, \dots, J
With a conjugate prior (inverse gamma) on both \tau and \sigma:
\sigma\sim IG(a, b), \tau\sim IG (\alpha, \beta) .

I generate some data where I on purpose enforce a large between-group variation. According to the explicit math expression, I would expect a bimodal posterior. However the stan posterior is unimodal. All diagnostics seem fine.

balanced_one_way.R (1.6 KB)
random_effect.stan (342 Bytes)

One one hand I am confused why they do not match. It is possible that only one mode is sampled. On the other hand, it is also not too surprising since we indeed seldom hear “you should worry about posterior bimodality in a hierarchical model when the between-group variation is large” or simply any complaint about the bimodality in eight schools.

edit:
I figure out it is practically hard to tell if a posterior sample is “bimodal” or '‘unimodal’ by just looking at the histogram. Chances are that it is a bimodal distribution but one of the modes is too tiny to be even visible. Bimodal distribution does not necessarily lead to a large between-mode barrier.

It also reminds me to think about parametrization.

Again I generate data from the following code:

set.seed(100)
N=10   # number of groups
J=5   #number of units in each group, denoted by ``n'' in the Liu and Hodges.
tau=1  # group-level sd
mu=0   # group mean
sigma=0.1  # individual level sd  
theta=rnorm(N,mu,sd=tau)  # latent group variables
theta=theta+rt(N,1)*10  # add some extra between-group-variation
y=matrix(NA, N, J)   # observed units
for(i in 1:N)
y[i,]=rnorm(J,theta[i], sd=sigma)

With prior, \sigma~inv-gamma(0.01, 0.01) \tau~inv~gamma(0.0001, 0.0001)

a=b=0.01
alpha=0.0001
beta=0.0001

Now we know it should a bimodal distribution according to that paper. Intuitively the bimodality comes from the data-prior conlifct with a large between group variation.

group_mean=rowMeans(y)
global_mean=mean(y)
S_b=sum(  (group_mean-global_mean)^2)*J
S_w=0
for(i in 1:N)
  for (j in 1:J)
    S_w=S_w+(y[i,j]-group_mean[i])^2
 F=S_b/(N-1)/S_w*N*(J-1)
 d=N*J+2*alpha+2
 f=N+2*a+2
 delta=4*(2*d+f)^2-12*(d+f)*(2*(f*beta+J*d*b)/S_b+f*N*(J-1)/(N-1)/F+d)
 delta>0
 
 phi_1=(2*(2*d+f)-sqrt(delta))/(d+f)/6
 phi_2=(2*(2*d+f)+sqrt(delta))/(d+f)/6
 g_1=(d+f)*S_b*phi_1^3-S_b*(2*d+f)*phi_1^2+(2*beta*f+f*S_w+2*J*d*b+d*S_b)*phi_1-f*(2*beta+S_w)
 g_2=(d+f)*S_b*phi_2^3-S_b*(2*d+f)*phi_2^2+(2*beta*f+f*S_w+2*J*d*b+d*S_b)*phi_2-f*(2*beta+S_w)
 (g_1*g_2<0)
 
## Theorem 1: If (delta>0) and  (g_1*g_2<0), The joint has two modes.

First we sample from the non-centered parameterization:

data{
real a;
real b;
real alpha;
real beta;
int N;
int J;
matrix[N,J] y;

}
parameters{
vector[N] theta_trans;
real mu;
real<lower=0> tau;
real<lower=0> sigma;
}
transformed parameters{
vector[N] theta;
theta=mu+ tau*theta_trans;
}

model{
for(i in 1:N)
  for(j in 1:J)
     y[i,j]~normal(theta[i], sigma);
theta_trans~normal(0,1);
sigma~inv_gamma(a, b);
tau~inv_gamma(alpha, beta);
}

It is slow and concerning, with the diagnostics:

Warning messages:
1: There were 5454 transitions after warmup that exceeded the maximum treedepth. Increase max_treedepth above 10. See
http://mc-stan.org/misc/warnings.html#maximum-treedepth-exceeded 
2: Examine the pairs() plot to diagnose sampling problems

OK then we use centered parameterization:

data{
  real a;
  real b;
  real alpha;
  real beta;
  int N;
  int J;
  matrix[N,J] y;
  
}
parameters{
  vector[N] theta;
  real mu;
  real<lower=0> tau;
  real<lower=0> sigma;
}
model{
  for(i in 1:N)
    for(j in 1:J)
       y[i,j]~normal(theta[i], sigma);
  theta~normal(mu, tau);
  sigma~inv_gamma(a, b);
  tau~inv_gamma(alpha, beta);
}

It is quick, and there is no warning, after 10000 iters and 4 chians. So it is an example where the naive centered parameterization is much better than the non-centered one with dimension =10.

But in both cases the sampling distribution is unimodal, or at least looks like unimodal in scatter plot. I don’t know if it is possible that the naive centered parameterization totally missed one of the two modes, which at least theoretically exist-- and such potential failure is not captured by diagnostics?

second edit:
I realize all the math formula in this paper is about “profile” posterior, which can be very different from the marginal density. So the theory results do not imply a practical concern. In fact, we can almost always construct posterior bimodality of the joint density by having a large between-group variation. But that indeed inflate \tau and eliminate the funnel in a helpful way. Then yes there are going to be two modes, but one of them (corresponding to a tiny tau) is ignorable because (1) it has a tiny density since the data requires a large between-group variation, so it does not affect the final prediction we care about. and (2) it is still connected to the right mode so it does not even hurt the computation.

In general, I think multimodal and metastable are different. The example above is bimodal but not metastable. On the other hand, a high dimension funnel can be unimodal, but the MCMC can get stuck in the mouth or the neck, making those two parts metastable.

I would imagine there can be an analog paper of this one, talking about the metastability or the bound of HMC mixing time in the same setting. What is needed is to compute the curvature, rather than the mode density.

1 Like