The negative correlation in the centered case is the classic funnel geometry. The global posterior mode in this parameterization sits in the neck of the funnel, but the neck is narrow enough that little posterior mass sits there.
In the NCP, my instinct is to search for the cause of the pinch outside of the typical set, because initializations in or near the typical set don’t manifest the pinch. It seems that the key issue somehow involves the interaction between sigma, the random effects vector, and the data. Above, @avehtari showed that initializing all unconstrained parameters on the interval [-.1, .1] avoids the pinch. A key thing to notice here is that the resulting initial values for sigma are in a range that still manifests the pinch when initializing all parameters on the interval [-2, 2]. I think we can conclude that a key determinant of whether or not the early warmup manifests the pinch is not necessarily the initialization of sigma, but rather the initialization of the random effect vector, which accounts for 1000 of the 1003 parameters in this model.
If I had to make a guess about the pinch, here’s what I’d guess currently:
- When the random effects vector is initialized to nonsense, it’s easy to see why early exploration might explore small values of
sigma. That is, from the point of initialization, the partial with respect tosigmais negative. - As we move towards smaller values of
sigma(for the sake of argument holding the random effects vector fixed), the gradient gets much smaller but never becomes positive. Therefore, we might expect the Hamiltonian trajectory to fall down towards small sigma and then continue skating along the flat towards negative-infinitelog(sigma)until a stopping criterion (either a u-turn or a treedepth limit) is triggered. - Of course we aren’t holding the random effects vector fixed–it’s also evolving along the trajectory. But maybe for high-dimensional random effects vectors, this evolution is slow enough that
sigmareaches its tiny values before the random effects vector can take on a configuration that would actually favor largersigma. Note that this would explain why the pinch only shows up when the random effects vector is high-dimensional. - Once we reach the tiny-sigma region, all gradients are weak. Changing unconstrained sigma doesn’t do much of anything when sigma is tiny, and changing the raw random effects vector also doesn’t do much when sigma is tiny.
- This also explains why initializing the random effecrts vector in a much narrower range also avoids the pinch. It’s because in these cases, the initial gradients with respect to
sigmaaren’t so steeply negative. This also helps to explain why the pinch only shows up for high-dimensional random effects vectors, because only when the vector is high dimensional would the gradient at a nonsense initialization be (very) steeply negative with respect tosigma.
If the above is right, then it should be the case that specifying a much lower treedepth lets chains fall into the pinch, but avoids having them skate out to absurdly small values for sigma, because at low treedepth we will terminate the trajectory and resample the momentum before a trajectory gets a chance to do much “skating” across the flat towards minuscule sigma. I just ran this experiment, and the results are consistent with my hunch.
fit <- re_mod$sample(data = re_data, chains = 8, parallel_chains = 4, save_warmup = T,
max_treedepth = 1,
iter_warmup = 20000,
iter_sampling = 1,
init_buffer = 19000
)
In this case, no chain ever explores sigma lower than 0.001. The chains all drop down into the region where sigma is between about 0.01 and 0.05, which I suppose is the best fitting region for sigma before the random effect vector has a chance to equilibrate. In the traceplot below, the chains don’t manage to start recovering until the metric adaptation kicks in after iteration 19000, but note that these 19000 iterations only represent as many leapfrog steps as 19 iterations that saturate the default max treedepth of 10.
