My understanding is that when fitting hierarchical models, Stan by default initializes all of the parameters to the interval [-2,2] on the unconstrained scale. So for example, if we write down a model like:
parameters{
real mu;
real<lower=0> sigma;
vector[500] v;
}
model{
sigma ~ normal(0, 10);
mu ~ normal(0, 10);
v ~ normal(mu, sigma);
}
then in particular each element of v will be initialized from a uniform distribution on [-2,2].
When initialized in this way, however, the probability mass for mu conditioned on v becomes heavily concentrated near zero. And so we might expect independent HMC chains to all pinch through a region around mu = 0 , even though the posterior distribution for mu in this example is ultimately normal(0,10). Indeed, that seems to be exactly what happens:
library("cmdstanr")
initsCheck_mod <- cmdstan_model("/Users/jacobsocolar/Dropbox/Work/Code/colombiaBeta/stan_sandbox/inits_check.stan")
initsCheck_samples <- initsCheck_mod$sample(
chains = 50,
iter_sampling = 200,
thin = 4,
iter_warmup = 1,
save_warmup = 1,
step_size = .002,
max_treedepth = 3)
d <- initsCheck_samples$draws()
mu <- as.data.frame(d[,,2])
mu$x <- 4*(1:50)
plot(mu[,1] ~ mu$x, xlim = c(0,200), ylim = c(-2,2))
for(i in 2:50){
points(mu$x, mu[,i])
}
Now admittedly this is a silly example, because it uses very poor tuning of the sampler, and there are obviously no pathologies that could pose an issue for sampling if we run this thing for long enough to pass diagnostics.
However, in applications involving complex real-world hierarchical models, it seems problematic(?) if, when we think we have provided overdispersed inits for mu, effectively we have not achieved that. As a Stan user, itās easy to circumvent the problem (if itās even a problem) by specifying overdispersed inits for mu and sigma, and then simulating inits for v outside of Stan based on the initial values of mu and sigma.
My three-part question (below) really boils down to whether it is worth worrying about any of this:
- Does this community agree that effectively Iām not getting overdispersed inits on mu?
- Are overdispersed initial values important in Stan, or is their usefulness superseded by well-tuned NUTS and HMC-specific diagnostics?
- Even with garden-variety MCMC, for a multi-modal model four chains seems like an awfully small number to yield a near-certain chance of initializing at least one chain in the basin of attraction of a different mode, especially if the posterior is high-dimensional. Are overdispersed inits really all that effective for diagnosing problems in the first place (as compared to initializing multiple chains in the same region of parameter space)?