Using output from optimization algorithms to initialize sampler

Todor · December 14, 2018, 11:13am

I’m trying to speed up model convergence, and am considering finding the posterior mode using Stan’s LBFGS routines, and then initializing the sampler chains with it, or a perturbation.

At first glance, I would expect this to be a relatively standard thing to do. However, I couldn’t find any information about people doing that (in this forum or elsewhere online).

Are there in fact advantages in initializing the sampler close to the mode? What are the drawbacks and caveats one should be aware of?

bgoodri · December 14, 2018, 3:37pm

I would say it is not worthwhile. The drawback that a lot of people overlook is that for models that are so complicated that the default initial values do not work, the mode can be very far from the mean and median. Indeed, the region around the mode can have essentially zero posterior probability.

If you are having problems initializing, then post the model and maybe someone can help you overcome them directly.

sakrejda · December 15, 2018, 3:03pm

What Ben said, but for some models it works in the Craig’s-list-used-car sense and might give you a hint if you’re stuck on why your initial values are not working. So no reason not to try it if you can’t find alternatives

Todor · December 15, 2018, 3:21pm

I think you are both misunderstanding my question. It’s more fundamental rather than about a specific model. The default values work. What I’m missing is, why would they work in any way better than the mode? Why isn’t the mode the default?

The particular case that made me think about this: I have a model that is too large, and I can only do about 100 HMC iterations, so I want to squeeze as much as possible out of those 100.

maxbiostat · December 15, 2018, 3:57pm

When @bgoodri said

he was referring to something called concentration of measure. And you should heed his warning: if your model is “big” (I understood this as high dimensional, correct me if I’m wrong), chances are that starting off from the mode is not going to do much in the way of speeding up convergence.

Because the added effort of finding it doesn’t pay off in high dimensions and in low dimensions it usually doesn’t matter.

lwiklendt · February 27, 2019, 12:56am

In terms of advantages to initialising the sampler close to the mode, it seems to work well for clustering models.

In the Stan User Guide, under the Clustering Models -> Multimodality section, it mentions: “the advice often given in fitting clustering models is to try many different initializations and select the sample with the highest overall probability. It is also popular to use optimization-based point estimators such as expectation maximization or variational Bayes, which can be much more efficient than sampling-based approaches.”

For the clustering model I’m working on at the moment, I’m using LBFGS init (in addition to Michael Betancourt’s advice about order constraining https://mc-stan.org/users/documentation/case-studies/identifying_mixture_models.html), and it works very well at keeping the samples in one mode.

monnahc · April 25, 2019, 6:51pm

I’m a bit late to this but you might want to browse this thread where I proposed the same idea. Some good discussion:

Topic		Replies	Views
Overdispersed initial values -- general questions Algorithms ecology	20	4241	May 26, 2018
Why doesn't sampling fail when optimizing does? Algorithms optimization , mcmc	3	880	February 13, 2020
User-Supplied Initial Values Modeling	3	585	October 2, 2019
Initialization using prior Developers	4	1233	January 15, 2020
Multi-modality of posteriors Modeling fitting-issues	3	807	December 10, 2018

Using output from optimization algorithms to initialize sampler

Related topics