Undesirable(?) behavior in initial values of hierarchical model

My understanding is that when fitting hierarchical models, Stan by default initializes all of the parameters to the interval [-2,2] on the unconstrained scale. So for example, if we write down a model like:

parameters{
  real mu;
  real<lower=0> sigma;
  vector[500] v;
}
model{
  sigma ~ normal(0, 10);
  mu ~ normal(0, 10);
  v ~ normal(mu, sigma);
}

then in particular each element of v will be initialized from a uniform distribution on [-2,2].

When initialized in this way, however, the probability mass for mu conditioned on v becomes heavily concentrated near zero. And so we might expect independent HMC chains to all pinch through a region around mu = 0 , even though the posterior distribution for mu in this example is ultimately normal(0,10). Indeed, that seems to be exactly what happens:

library("cmdstanr")

initsCheck_mod <- cmdstan_model("/Users/jacobsocolar/Dropbox/Work/Code/colombiaBeta/stan_sandbox/inits_check.stan")
initsCheck_samples <- initsCheck_mod$sample(
  chains = 50,
  iter_sampling = 200,
  thin = 4,
  iter_warmup = 1,
  save_warmup = 1,
  step_size = .002,
  max_treedepth = 3)

d <- initsCheck_samples$draws()

mu <- as.data.frame(d[,,2])
mu$x <- 4*(1:50)
plot(mu[,1] ~ mu$x, xlim = c(0,200), ylim = c(-2,2))
for(i in 2:50){
  points(mu$x, mu[,i])
}

Now admittedly this is a silly example, because it uses very poor tuning of the sampler, and there are obviously no pathologies that could pose an issue for sampling if we run this thing for long enough to pass diagnostics.

However, in applications involving complex real-world hierarchical models, it seems problematic(?) if, when we think we have provided overdispersed inits for mu, effectively we have not achieved that. As a Stan user, itā€™s easy to circumvent the problem (if itā€™s even a problem) by specifying overdispersed inits for mu and sigma, and then simulating inits for v outside of Stan based on the initial values of mu and sigma.

My three-part question (below) really boils down to whether it is worth worrying about any of this:

  1. Does this community agree that effectively Iā€™m not getting overdispersed inits on mu?
  2. Are overdispersed initial values important in Stan, or is their usefulness superseded by well-tuned NUTS and HMC-specific diagnostics?
  3. Even with garden-variety MCMC, for a multi-modal model four chains seems like an awfully small number to yield a near-certain chance of initializing at least one chain in the basin of attraction of a different mode, especially if the posterior is high-dimensional. Are overdispersed inits really all that effective for diagnosing problems in the first place (as compared to initializing multiple chains in the same region of parameter space)?
3 Likes

Could this be related to why the non-centered parameterization came about? If you do:

parameters{
  real mu;
  real<lower=0> sigma;
  vector[500] z;
}
transformed parameters{
	vector[500] v = mu+sigma*z ;
}
model{
  sigma ~ normal(0, 10);
  mu ~ normal(0, 10);
  z ~ std_normal();
}

Thereā€™s no longer the pinch in the plot:

2 Likes

Thatā€™s a very cool benefit of the non-centered parameterization! When we non-center, we are no longer initializing the elements of v on [-2,2] but rather we are initializing mu and sigma and simulating v down from the distribution implied by our inits for mu and sigma. I had suggested doing this outside of Stan and then passing the inits explicitly, but I had overlooked the native Stan implementation of the same thing (achieved by non-centering).

Iā€™d still be curious about thoughts regarding the general importance of choosing overdispersed inits. After all, sometimes the centered parameterization is better-behaved than the non-centered.

2 Likes

Iā€™m out of my depth on this one; @martinmodrak who might be a good dev to tag to chime in on an MCMC initialisation topic like this?

I think think this is a definitive yes.

My guess here (not really an expert) is that:

  • In a high-dimensional space (i.e. with many parameters) even fine-tuned inits are very unlikely to actually start in a region that contributes non-negligible mass to the posterior. I.e. in a well-behaved model ā€œgoodā€ inits are only marginally better than ā€œbadā€ inits.
  • Even if you luck out and start in a good region of the parameter space, the chain is not yet adapted and the first iteration can easily move you to a terrible place
  • Inits are important in models that are not well-behaved in that you need to init in a place that avoids pathologies (e.g. for something like a ~ normal(1e7, 0.0001) you may need to start with a close to 1e5 to avoid numerical issues with the way we compute the normal density).
  • Good inits can make your adaptation faster, especially if you have a lot of parameters where the default inits would be in the very tails. The typical example are measurement error models, where you would have stuff like x ~ normal(x_true, sigma) where x_true is a large vector of parameters and x_true can span very wide ranges.

Thatā€™s definitely an issue. However note that multimodality gets (usually) detected not only if you init in two different modes, but also if the other mode is discovered during warmup and the chain then adapts to the mode or when a chain attempts to traverse between the modes during sampling (in which case it often creates a divergence). So your odds are often no that terrible. I donā€™t doubt you could manufacture models with two very distant modes where the traversal is impossible, there are no divergences and one of the modes is very hard/impossible to hit by inits, but I would guess such models are quite rare in practice. Most multimodal models Iā€™ve dealt with had the multimodality arise due to a simple symmetry or just one or two parameters.

For any more depth, I would defer to @betanalpha

2 Likes

There are a few independent concepts being discussed here.

Firstly the Stan program being presented here implements hierarchal model that manifests a funnel degeneracy. The strong, nonlinear interactions inherent to the funnel are too much for just about every MCMC sampler (and limit the accuracy of non sampling-based computational methods). As @mike-lawrence notes this can be remedied with a non-centered parameterization of the same model. I am nearly completing of a hierarchal modeling case study that goes into gory details; it should be publicly available sometime next month.

The question presented in the first post, however, is one of equilibration. In particular will the Markov chain be able to find and then explore the typical set of the target distribution when itā€™s initialized outside of that typical set? Interesting most Markov chain Monte Carlo algorithms find the typical set (Iā€™m assume here a connected typical set so thereā€™s only one to find) very quickly ā€“ Hamiltonian Monte Carlo in particular is extraordinarily good at getting to the typical set even when initialized far away. The much harder problem is exploring the entirety of the typical set one the Markov chain gets there. In the case of the funnel most Markov chain Monte Carlo algorithms, including constant-metric implementations of Hamiltonian Monte Carlo, find the typical set of the funnel without any problems even when initialized very far away. Once they get there they run into obstructions that limit effective exploration within the typical set.

Then thereā€™s the separate question of diffuse initializations. The main utility of diffuse initializations is to give each Markov chain the best chance at seeing different parts of the target distribution and hence the best chance at encountering pathologies. If all of the Markov chains equilibrate to the same behavior then there are no indications of problems within the scope of the initializations. That, however, is only a limited result. Firstly the Markov chains could have missed something within that scope ā€“ the more Markov chains run the less likely this will be ā€“ and secondly the Markov chains may not see anything outside of that scope if their exploration is obstructed.

In other words if the Markov transition is effective then Markov chains can expand from a narrow initialization range to explore the entire typical set of the target distribution. If the Markov chains donā€™t expand then either the typical set is contained within that range or the exploration was obstructed; thereā€™s no way to differentiate between those two possibilities empirically. On the other hand if the Markov chains all contract into the same neighborhood much smaller than the initialization range then weā€™re less likely to have missed something.

4 Likes

I need to remember this more often; especially for those models (admittedly from years ago before I could expect myself to know better) where I ā€œsolvedā€ my sampling problems by setting init=0 šŸ˜¬

Playing around with initializations is dangerous but sometimes necessarily.

For example when the evaluation of the target density is fragile at more extreme model configurations (ODEs become stiff, power series converge too slow, etc) Markov chains that stray too far can become unstable and unusable. Initialization within the stable regions can avoid numerical problems but that will compromise the statistical validity of the Markov chains unless the unstable regions have negligible probability. Itā€™s up to the user to verify that.

Another example is multimodal likelihood functions ā€“ priors can suppress extraneous modes but they canā€™t remove them entirely. Initializations near an irrelevant mode will tend to get the Markov chains stuck near that mode regardless of how little probability that mode captures. Custom initializations can avoid these extraneous modes but again itā€™s up to the user to verify that they are in fact extraneous.

I think I am actually encountering this right now in a model discussed here. At least, in that case I find that initialization at/near the MLE of the parameter space keeps things from going haywire, but Iā€™m still working out why they go haywire specifically in the case of having more data in the first place.

IID data just amplifies the shape of the individual likelihood function. If each individual likelihood function is mildly multimodal ā€“ but not so much that the sampler canā€™t move between the modes and find all of the probability mass ā€“ then the product of likelihood functions can be extremely multimodal ā€“ such that the sampler gets stuck in bad modes due to initialization. In other words peaks get amplified and the valleys between the peaks get suppressed.

1 Like