Handling multimodality in HMC/Stan - tempering & NUTS?

I’m working on a embedding model that seems to suffer from a large number of isolated local maxima. I’m currently trying to recover the original coordinates of a dataset generated synthetically, but chains initialized randomly get stuck in local maxima when compared to a chain initialized at the ground truth. At the moment, the solutions I obtain are not very good, but I expect that in some datasets there could be multiple decent embeddings and thus I would like the sampler to better explore the landscape.

For that reason, I’m currently diving deeper in the technical details of HMC and all its variants to understand how I could help the sampler cross low probability regions. Namely, I have come across the idea of tempering which looks helpful in my situation. However, the basic approach of increasing the mass during the first leap frog steps and then decreasing it doesn’t seem compatible with the dynamic adaptation performed with NUTS (there is no way to know in advance how many steps to do). Is there a way to mix NUTS and tempering?

I’m interested in any way to handle isolated modes in Stan, but if it’s not feasible I’m willing to work out the code myself.

Thank you in advance for your time.

From your description it sounds like you might be talking about the tempering during a trajectory approach described by Radford Neal in Section 5.7 of his MCMC using Hamiltonian Dynamics chapter in the Handbook of Markov chain Monte Carlo? If so I think you are correct that it would problematic using this scheme within a no-U-turn sampler (NUTS) or more specifically the dynamic multinomial HMC scheme currently used in Stan (which includes several improvements to the original NUTS algorithm), both due to the number of integrator steps in each trajectory not being fixed or known at the start of each trajectory, but also as the trajectory integrate both forwards and backwards in time.

There are a variety of other related ‘tempering’ schemes that can be used with a dynamic HMC algorithm however. Simulated tempering augments the original state the target distribution is defined on with an auxiliary (inverse) ‘temperature’ variable with a Markov chain then simulated by alternating transitions which update the original state for a fixed value of the inverse temperature variable with transitions which update the inverse temperature for a fixed value of the original state. A dynamic HMC algorithm can be used to implement the former transitions (of the original state given inverse temperature). This could potentially be performed using Stan if you included the inverse temperature variable in the data block and used an loop in the language you are using to interface with Stan to alternate sampling a single (non-adaptive) chain transition using Stan with the current value of the inverse temperature passed in as data, with transitions updating the inverse temperature variable (which in the original simulated tempering algorithm is discrete).

A potentially simpler alternative is the continuously tempered Hamiltonian Monte Carlo algorithm (disclosure: I am an author on the paper so I my viewpoint might be a bit biased here!). Rather than augmenting the original state with a discrete inverse temperature variable, the key idea here is to instead use a continuous inverse temperature variable. We can then jointly update both the original state and inverse temperature using a dynamic HMC algorithm targeting the augmented joint distribution. This means the sampling can be performed entirely within Stan without the need to manually deal with separately updating the inverse temperature variable.

A key issue in both the original simulated tempering algorithm and continuous tempering variant, is the need to define a suitable ‘base’ distribution that corresponds to the targeted distribution when the inverse temperature variable is equal to zero (with the original target distribution corresponding to the inverse temperature variable being one). A simple approach is to use the prior distribution as this base distribution, however in the common situation that the posterior distribution significantly differing from the prior this can lead to poor mixing of the chains along the auxiliary inverse temperature variable axis. An approach we suggest in the continuously tempered HMC paper is to instead using a simple approximation to the posterior, for example a Gaussian approximated fitted using variational inference, as the base distribution, with this often significantly improving performance.

Another related approach that I am less familiar with is Adaptive Path Sampling and Continuous Tempering - this shares some similarities with the continuously tempered HMC approach, but has a clever adaptive strategy for tuning the marginal distribution on the inverse temperature variable that (I think) helps address the issue I allude to above. @yuling may be able to give more details! Example code using Stan within this framework is available at GitHub - yao-yl/path-tempering: continuous tempering by path sampling.

An alternative class of Monte Carlo methods which are often well suited to working with multimodal distributions are sequential Monte Carlo samplers (SMCS) - the recent paper An invitation to sequential Monte Carlo samplers by Dai et al. is a very nice and comprehensive introduction. SMCS methods involve sequentially updating an ensemble of particles, on each iteration generating an approximation to one of a sequence of target distributions. Each iteration requires updating each particle by applying a Markov transition leaving the current target distribution in the sequence invariant, computing weights for the particles based on the ratio of their densities under the current and previous target distributions, and then resampling the ensemble based on these weights. The sequence of target distributions can be specified to correspond to a ‘tempering’ like scheme which geometrically bridges from a base distribution to a target posterior distribution of interest, but other ‘paths’ of distributions can also be used - for example by sequentially introducing likelihood terms corresponding to individual observations.

The Markov transitions leaving each target distribution in the path invariant can use dynamic HMC algorithms such as that implemented in Stan, an in fact there is an example in the invitation to SMCS paper of using Stan within a SMCS to target the posterior distribution of a susceptible-infected-recovered ordinary differential equation model (code available at GitHub - pierrejacob/smcsamplers). By requiring an ensemble of particles to be simulated SMCS methods can be somewhat computationally expensive compared to running a few chains with a standard MCMC method, but compared to the simulated / continuous tempering approaches described above, they are often much more robust to mismatch between the base and target distributions used, particularly if an adaptive scheme is used for choosing the next distribution on the path to use based on maintaining an lower bound on the estimated effective sample size.


matt: very nice summary!
just to echo matt on “adaptive path sampling”: it uses nuts x tempering with a continuous temperature. The linked GitHub link has a semi-automated implantation, which basically runs an augmented stan file iteratively.

I would also add another practical solution is stacking of multiple chains, which is to sample each mode and then average over them.


Potentially also relevant:


Thanks to all of you, this gives me a lot of approaches to try!