Intuition for why sampling is much slower when starting at max likelihood?

Just a curiosity really… I work with some complex models that can be tricky for both optimization and sampling approaches. Both as a form of robustness check for an optimization solution, and a sampling speed up, I have an option in ctsem to start sampling at a (relaxed tolerance) max likelihood / max a posteriori solution. In some cases this is far, far slower than a random initialization (watching the log posterior plot, it takes a long time to drop down to the ‘typical set’ level), but this is resolved when I initialize the chains at the maximum + a tiny bit of noise, such that the log posterior starts at or a little below the typical level sampling will achieve. I can’t come up with a picture in my head for why this occurs, I’d like one, please ;) btw this is not because the parameter values at max likelihood are a long way from typical values – the problems are constrained and integrated in a way that they are well behaved in this regard – the parameter values at max likelihood are usually a good approximation of the centre of the posterior distribution.

1 Like

I’m not 100% sure this is what you’re seeing, but note that if a chain starts from low-probability regions, then its tendency will be to fall down towards the typical set. If a chain starts near the maximum, then the only way for it to climb up to the typical set will be if the initial momentum is sufficient to enable it to climb far enough. I wonder whether the momenta are too small in magnitude to enable your chain initialized near the maximum to climb quickly–it might take many iterations to accumulate enough energy to climb up to the typical set.

I’ve been playing in this space lately too (generating fast samples from the prior then getting the lp__ given the data too), but what I do is rank my initialisation candidates by lp__ and grab those that rank closest to the median, which I figure (assuming a not-pathological candidate generation) should be closer to the typical set than the MAP.

Could it also be that the gradient at the MAP is very different than in the typical set, making for bad initial estimates of the mass matrix? Ditto step size.

It could be! But the gradient is also quite different in the tails of the posterior far from the typical set, and yet the chains still fall down into the typical set relatively quickly. Recall also that the gradient adaptation is memoryless from window to window, and that the step-size adaptation is generally pretty fast. In fact, early on we want the step size to adapt to the local region of parameter space where we find ourselves–the common pattern when we initialize way out in the tails is for the step size to crash downwards to a pretty small value early in warmup, which enables the chains to wander off and find the typical set without diverging all the time. The step size comes back up as we find the typical set and especially as we find better estimates for the mass matrix.