Using output from optimization algorithms to initialize sampler

I’m trying to speed up model convergence, and am considering finding the posterior mode using Stan’s LBFGS routines, and then initializing the sampler chains with it, or a perturbation.

At first glance, I would expect this to be a relatively standard thing to do. However, I couldn’t find any information about people doing that (in this forum or elsewhere online).

Are there in fact advantages in initializing the sampler close to the mode? What are the drawbacks and caveats one should be aware of?

I would say it is not worthwhile. The drawback that a lot of people overlook is that for models that are so complicated that the default initial values do not work, the mode can be very far from the mean and median. Indeed, the region around the mode can have essentially zero posterior probability.

If you are having problems initializing, then post the model and maybe someone can help you overcome them directly.

3 Likes

What Ben said, but for some models it works in the Craig’s-list-used-car sense and might give you a hint if you’re stuck on why your initial values are not working. So no reason not to try it if you can’t find alternatives

I think you are both misunderstanding my question. It’s more fundamental rather than about a specific model. The default values work. What I’m missing is, why would they work in any way better than the mode? Why isn’t the mode the default?

The particular case that made me think about this: I have a model that is too large, and I can only do about 100 HMC iterations, so I want to squeeze as much as possible out of those 100.

When @bgoodri said

he was referring to something called concentration of measure. And you should heed his warning: if your model is “big” (I understood this as high dimensional, correct me if I’m wrong), chances are that starting off from the mode is not going to do much in the way of speeding up convergence.

Because the added effort of finding it doesn’t pay off in high dimensions and in low dimensions it usually doesn’t matter.

2 Likes

In terms of advantages to initialising the sampler close to the mode, it seems to work well for clustering models.

In the Stan User Guide, under the Clustering Models -> Multimodality section, it mentions: “the advice often given in fitting clustering models is to try many different initializations and select the sample with the highest overall probability. It is also popular to use optimization-based point estimators such as expectation maximization or variational Bayes, which can be much more efficient than sampling-based approaches.”

For the clustering model I’m working on at the moment, I’m using LBFGS init (in addition to Michael Betancourt’s advice about order constraining https://mc-stan.org/users/documentation/case-studies/identifying_mixture_models.html), and it works very well at keeping the samples in one mode.

I’m a bit late to this but you might want to browse this thread where I proposed the same idea. Some good discussion: