It also reminds me to think about parametrization.

Again I generate data from the following code:

```
set.seed(100)
N=10 # number of groups
J=5 #number of units in each group, denoted by ``n'' in the Liu and Hodges.
tau=1 # group-level sd
mu=0 # group mean
sigma=0.1 # individual level sd
theta=rnorm(N,mu,sd=tau) # latent group variables
theta=theta+rt(N,1)*10 # add some extra between-group-variation
y=matrix(NA, N, J) # observed units
for(i in 1:N)
y[i,]=rnorm(J,theta[i], sd=sigma)
```

With prior, \sigma~inv-gamma(0.01, 0.01) \tau~inv~gamma(0.0001, 0.0001)

```
a=b=0.01
alpha=0.0001
beta=0.0001
```

Now we know it should a bimodal distribution according to that paper. Intuitively the bimodality comes from the data-prior conlifct with a large between group variation.

```
group_mean=rowMeans(y)
global_mean=mean(y)
S_b=sum( (group_mean-global_mean)^2)*J
S_w=0
for(i in 1:N)
for (j in 1:J)
S_w=S_w+(y[i,j]-group_mean[i])^2
F=S_b/(N-1)/S_w*N*(J-1)
d=N*J+2*alpha+2
f=N+2*a+2
delta=4*(2*d+f)^2-12*(d+f)*(2*(f*beta+J*d*b)/S_b+f*N*(J-1)/(N-1)/F+d)
delta>0
phi_1=(2*(2*d+f)-sqrt(delta))/(d+f)/6
phi_2=(2*(2*d+f)+sqrt(delta))/(d+f)/6
g_1=(d+f)*S_b*phi_1^3-S_b*(2*d+f)*phi_1^2+(2*beta*f+f*S_w+2*J*d*b+d*S_b)*phi_1-f*(2*beta+S_w)
g_2=(d+f)*S_b*phi_2^3-S_b*(2*d+f)*phi_2^2+(2*beta*f+f*S_w+2*J*d*b+d*S_b)*phi_2-f*(2*beta+S_w)
(g_1*g_2<0)
## Theorem 1: If (delta>0) and (g_1*g_2<0), The joint has two modes.
```

First we sample from the non-centered parameterization:

```
data{
real a;
real b;
real alpha;
real beta;
int N;
int J;
matrix[N,J] y;
}
parameters{
vector[N] theta_trans;
real mu;
real<lower=0> tau;
real<lower=0> sigma;
}
transformed parameters{
vector[N] theta;
theta=mu+ tau*theta_trans;
}
model{
for(i in 1:N)
for(j in 1:J)
y[i,j]~normal(theta[i], sigma);
theta_trans~normal(0,1);
sigma~inv_gamma(a, b);
tau~inv_gamma(alpha, beta);
}
```

It is slow and concerning, with the diagnostics:

```
Warning messages:
1: There were 5454 transitions after warmup that exceeded the maximum treedepth. Increase max_treedepth above 10. See
http://mc-stan.org/misc/warnings.html#maximum-treedepth-exceeded
2: Examine the pairs() plot to diagnose sampling problems
```

OK then we use centered parameterization:

```
data{
real a;
real b;
real alpha;
real beta;
int N;
int J;
matrix[N,J] y;
}
parameters{
vector[N] theta;
real mu;
real<lower=0> tau;
real<lower=0> sigma;
}
model{
for(i in 1:N)
for(j in 1:J)
y[i,j]~normal(theta[i], sigma);
theta~normal(mu, tau);
sigma~inv_gamma(a, b);
tau~inv_gamma(alpha, beta);
}
```

It is quick, and there is no warning, after 10000 iters and 4 chians. So it is an example where the naive centered parameterization is much better than the non-centered one with dimension =10.

But in both cases the sampling distribution is unimodal, or at least looks like unimodal in scatter plot. I donāt know if it is possible that the naive centered parameterization totally missed one of the two modes, which at least theoretically exist-- and such potential failure is not captured by diagnostics?

second edit:

I realize all the math formula in this paper is about āprofileā posterior, which can be very different from the marginal density. So the theory results do not imply a practical concern. In fact, we can almost always construct posterior bimodality of the joint density by having a large between-group variation. But that indeed inflate \tau and eliminate the funnel in a helpful way. Then yes there are going to be two modes, but one of them (corresponding to a tiny tau) is ignorable because (1) it has a tiny density since the data requires a large between-group variation, so it does not affect the final prediction we care about. and (2) it is still connected to the right mode so it does not even hurt the computation.

In general, I think multimodal and metastable are different. The example above is bimodal but not metastable. On the other hand, a high dimension funnel can be unimodal, but the MCMC can get stuck in the mouth or the neck, making those two parts metastable.

I would imagine there can be an analog paper of this one, talking about the metastability or the bound of HMC mixing time in the same setting. What is needed is to compute the curvature, rather than the mode density.