Pseudo-extended MCMC

Does this community have an opinion on pseudo-extended MCMC? It seems like a simple tool that could be of use to this community, but I have yet to see anyone mention it — the paper’s examples are even written in Stan.

The results are quite encouraging, with the sampler navigating seemingly degenerate geometries with ease. Would this also help with geometries that induce divergences and the like?

If I had to voice some concerns, it would be the increased number of parameters in particular, but maybe also the amount of bookkeeping necessary; it’s fairly straightforward, but would be nice if it could somehow be automated as part of the fitting process, instead of causing bloat in the stan file.

abstract:

Sampling from the posterior distribution using Markov chain Monte Carlo (MCMC) methods can require an exhaustive number of iterations to fully explore the correct posterior. This is often the case when the posterior of interest is multi-modal, as the MCMC sampler can become trapped in a local mode for a large number of iterations. In this paper, we introduce the pseudo-extended MCMC method as an approach for improving the mixing of the MCMC sampler in complex posterior distributions. The pseudo-extended method augments the state-space of the posterior using pseudo-samples as auxiliary variables, where on the extended space, the MCMC sampler is able to easily move between the well-separated modes of the posterior.
We apply the pseudo-extended method within an Hamiltonian Monte Carlo sampler and show that by using the No U-turn algorithm (Hoffman and Gelman, 2014), our proposed sampler is completely tuning free. We compare the pseudo-extended method against well-known tempered MCMC algorithms and show the advantages of the new sampler on a number of challenging examples from the statistics literature.

The post-hoc correction in Proposition 3.1.seems dangerous – if I really know a good proposal q, a naive importance sampling will work. Of course, increasing N to infinity will overcome the bad proposal choice but also impractical.

1 Like

I don’t think q necessarily needs to be a good proposal distribution in the IS sense. You apply HMC to the auxiliary distribution in (4) and then Proposition 3.1 tells you that if your sample from (4) is correct, sampling one of the particles/copies according to the probabilities given by 3.1 yields a sample from your original density. So if you can sample correctly from (4), proposition 3.1 seems like a harmless step?

We do not need a perfect proposal, but when the ratio γ(x)/q(x) is far away from 1, the efficiency becomes a concern.

To be fair, I do not doubt it will perfectly work in some low or moderate dimension problems. It seems we have already had many solutions to multimodality, such as tempering, annealing, or adiabatic MC. But my expression is that they do not adapt well to real hard problems involving multimodality, otherwise we will use Stan as a deep learning platform. It will be interesting to test where the limit is.

The difference is that in IS the weights are normalized over all samples, so that you can have mass concentrate on a few values. Here, you normalize across the copies for each sample, so the copies collectively correspond to 1 true sample (assuming the HMC worked). If the ratio is far from 1, it just seems that the particles will be weighted equally in that iteration, but not discarded.

Given that (a) there are a ton of these alternative MCMC tweaks coming out, and that (b) none of the ones we’ve tried in the past have worked in any generality, that we are basically just waiting for someone to try these new proposed algorithms on a larger test suite of examples.

Multimodality isn’t really an issue that we can solve with sampling—there’s just no way to visit all the modes and properly integrate over the posterior. Andrew and Yuling and crew have been looking at doing things like stacking multiple runs that get trapped in modes, but they’re still working on that.

1 Like

From the abstract I see where the focus on multimodality comes from, but it strikes me that this might be good for more than that? In their latter examples they demonstrate good mixing on some complicated non-modal (albeit low-dimensional) geometries; my impression is that they provide auxiliary dimensions/directions along which the density is smoothed, making it easier to navigate. It might also help with cusp- or funnel-like features?

Are there any other of these tweaks that improve the geometry of the problem by a similar augmentation scheme? I would be interested. I would for that sake also be interested in hearing why it would not work.

The challenge is that MCMC is the best way to estimate a high-dimensional posterior that we have. Most of the proposals I see don’t work well in high dimensions. It’s difficult to bolt something onto MCMC that efficiently learns the posterior geometry. So it becomes a chicken-and-egg problem, or a gradual improvement problem, like we apply to adaptation in Stan (now that’s something I could easily believe could be improved with a little work). It’s challening to estimate posterior covariance, but even harder if it varies across the posterior, as in a hierarchical model.

If we could estimate the geometry well, we could adjust for it. But that’s a very tricky thing to do in even moderately high dimensions.

I don’t know anything about this particular paper, but have you noticed that all papers show examples where their system worked? The key qualification here is “low dimensional”. Even Metropolis and Gibbs and ensemble methods work in low dimensions.

This paper says it’s tuning parameter free in the abstract, then later qualifies this by saying “aside from choosing the number of pseudo-samples, the pseudo-extended method is completely tuning free.” This kind of choice can be critical for efficiency and robustness. They back that up later in the paper, saying “Figure 4 suggests that the number of pseudo-samples may be problem specific.”

What you need to be super careful about in general is making sure the results aren’t achieved by grad student labor on tuning parameters. For example, static HMC can look good in papers if you send your lab off hunting for good parameters for it (very compute intensive); but when you find them, it mixes almost as well as NUTS in a lot of cases. I’m not saying that’s a concern here, just pointing out it’s a common problem.

What we’re looking for is algorithms that run well without manual intervention on tuning parameters for the real-world applied stats problems our users are interested in. If someone could show this working better than NUTS (not just HMC) without human tuning on our dozen or so canonical models where we believe we know the correct answer, I’d take it more seriously as a candidate for Stan.

The real problem is that several of these improved HMC papers show up on arXiv every month and we (the Stan project) don’t even have time to evaluate them. One of the things we need to work on is a better test set to make it easier for others to test their algorithms in ways that will convince us to include them in Stan.

Also, nobody’s going to be able to tackle the combinatorially intractable multimodality, like that of general Ising models or multi-normal mixtures (aka K-means clustering), at least not without constructively proving P = NP. Even mentioning these problems is a red flag for me as nobody’s going to be able to do full Bayes for them. As expected, the authors only tried them in low-ish dimensions (28 is still pretty high dimensional).

In this specific paper, I was also disappointed that the authors compared vs. poorly tuned HMC rather than NUTS.

I’m not trying to be negative here and I only just scanned this paper to try to answer questions about why we’re usually skeptical.

4 Likes

Thanks for taking the time to respond. I’m admittedly a bit of an Arxiv addict, but with that I am also fairly quick to discard papers that are incremental or overblown, so I only bring this paper up because I think 1) it doesn’t ring any of my Arxiv alarm bells 2) they implemented it in Stan, making it community relevant 3) my remaining concerns are best answered by this community.

If you are right that they are comparing with standard poorly tuned HMC I would agree that is a concern - since it’s implemented in Stan I admit I have assumed that they meant NUTS. They would have to be going out of their way not to use it.

Pseudo-samples is a hyperparameter, true, but it’s my impression that the difference in efficacy comes from each sample smoothing the density geometry at the cost of increasing the sampling space dimensionality. NUTS/HMC does mix slower in high-dimensional spaces right?

Edit: they are using NUTS throughout:
"We use the NUTS (Hoffman and Gelman, 2014) tuning algorithm for HMC as implemented
within STAN (Carpenter et al., 2017) for both standard HMC and pseudo-extended HMC, with
N = 2 pseudo-samples and both algorithms ran for 10,000 iterations, with the first half of
the chain removed as brun-in. "

Edit2: with respect to multimodal navigation being NP-hard, I agree that this obviously doesn’t resolve that; generalizing the intuition from figure 1 to higher dimensions, I think the result of the method is to connect the modes via a trellis-like structure. Since that trellis is combinatorial in nature, it’s still NP to navigate

Yes, but not as much slower as you might think. The problem is more when the curvature gets tricky or there’s lots of correlation.

They must be taking the Betancourt position of just calling NUTS “HMC” then.

Sure. But the point is more whether you can estimate the geometry well in higher dimensions without exponentially increasing number of samples required. That’s the usual killer for these algorithms.

But as I said, if someone can compare this on some real problems we care about in high dimensions, then we’d be more interested. As I keep trying to emphasize, we just don’t have time to evaluate all these algorithms.

My thoughts on this

  • The examples they have are really only using Stan and code written in Stan language, so it is already available for anyone to use (Python is used to call PyStan and for plotting).
  • They modify the target and add more variables (called pseudo samples). Stan’s dynamic HMC is used in normal way sample from the modified target. They make the correction in generated quantities using importance weighted resampling.
  • It seems that the usual Pareto k diagnostic could used to diagnose the goodness of the post-hoc importance re-sampling
  • The increase in the dimensionality is the number of pseudo samples which in 2D Stan examples was 2. Since Stan’s dynamic HMC scales well to higher dimensions it shouldn’t be a problem to have a few dimensions more (In some cases due to NUTS criterion the sampling can be even more efficient with more variables even if they would be independent from the interesting ones. For example, sampling from 4D unit Gaussian has 50% higher effective sample size than sampling from 2D unit Gaussian.)
  • It would have been useful to report the step sizes and treedepths with and without pseudo-extension (but the code is available, so it would be trivial to check these and learn more)
  • It seems the approach would be useful also for funnel shaped posteriors. If I would have a student in a need of project work I could ask them to test with 8 schools.
  • Before trying to make something like this more automated, we would need much more different types of examples to learn whether it really works in general. Since the manual work changing the code doesn’t look that bad, this could be one possibility to try when having a model with severe mixing problems due to funnels or multiple modes (and then report results here, too).
1 Like

I wonder what’s meant by “incremental”.

If a paper looks too incremental or too overhyped, it won’t get published. I’m sad about the too incremental looking ones getting rejected, because that’s the problem I usually have. Stan was certainly incremental—I designed the first version of the language by working out how I’d have to interpret the BUGS syntax in order to define a log density I could autodiff. It was only then that I realized it’d be easy to turn the whole thing into a general imperative programming language. But I didn’t have to invent that concept, either. Just glue a few things together.

On a less engineering note, NUTS is just an incremental improvement on HMC that autotunes some parameters. It’s clever and fairly subtle to prove it’s correct, but it wouldn’t have been possible without slice sampling and HMC as a baseline and a whole lot of work on autocorrleation and reversibility.

The big jumps are few and far between. Relativity theory may seem like a huge conceptual leap, but Riemann had already laid down the mathematical foundations for the differential geometry of curved spaces. You might trace the idea of a reference frame back to the Copernican revolution that put the sun at the center of the solar system.

And where would Riemann have been without Gauss? Hard to say.

1 Like

Your definition of “incremental” is in tune with mine. If @Bonnevie’s is the same, I struggle to see how that’d be reason enough to “discard” a piece of work. Stan’s discourse forum is hardly the place for a philosophical discussion on what good science is or isn’t, so I won’t extend this much further. But to my mind, there’s nothing wrong with “incremental” papers: I will welcome any solid contributions to scientific knowledge, however small, as long as they are, well, solid. It’s my personal, unsubstantiated opinion that the megalomania behind trying to make every paper look like a breakthrough is responsible for a lot of the pollution of the published record with things that don’t work.

Oups, seems I poked a beehive by accident. Incremental might have been the wrong term to use since it implies that we are building towards something, but it’s a term that is tossed around when discussing the publication standard in machine learning (deep learning in particular) where there seems to be a lot of papers making small unsubstantiated tweaks without properly documenting their effects. So in the language of @maxbiostat, replace it with “not solid” :) Again, based on my perusal of Arxiv, I can only agree that hype is a significant problem, which I think falls under my “overblown” qualifier. We should probably support venues or formats (blogs? letters?) that encourage presentation of smaller contributions, to give researchers an outlet and disincentivize trying to cram a small result into a conference/journal paper format.

1 Like

Since you doubt the space expansion is going to have a big efficiency impact, I guess something must be going wrong with the geometry when employing more pseudo-samples as you argue. Odd since there definitely seems to be some support for it making the geometry better in the demos?

Yes, I definitely wouldn’t expect you to look into every small innovation, I appreciate the time you have already spent on this small digression - as noted, I only posted it here because I thought it was of particular concern to Stan.

Does the resampling incur any risks? Since it occurs on the pseudo-sample level, it seems we avoid the issues of e.g. IS where the samples can have very uneven influence.

One practical concern of mine is what to do in high-dimensional inference problems. In hierarchical or multivariate models the parameter count will be a lot higher - do you see any way the model could be applied to parameter subsets instead of the whole set?

I agree that it likely needs further validation, just not sure what problems aside from e.g. 8 schools would make for a convincing demonstration.

2 Likes

I think the resampling has the usual importance sampling risk. They use implicit proposal which adapts a bit, but I guess that it can work with a large number dimensions for some models, but may fail already with just more than 20 dimensions for some models. I’m not able to say more without seeing more experimental results.

I don’t understand the question, I guess there is a typo in there.

  • Mixture models would be natural example having both multimodality and funnel.
  • Gaussian process models with non-Gaussian likelihoods (I think James Hensman mentioned they have tested this in GPflow after finishing the paper).
  • Horseshoe and regularized horse prior on a linear model with large p and small n
  • Any models which get lot of divergences

But in IS we draw samples from a distribution different from the target, and then reweight to mimic the target, while here we end up with unweighted samples at the end? I’m failing to understand how the risk creeps in.

The second bit was simply a question about whether this could be extended to bigger models without replicating the entire parameter set, i.e. whether it could be applied to a block of parameters instead. But I doubt it.

Top of page 7 of the article:

Samples from the original target of interest, \pi(x) are given by using a post-hoc correction step, where the samples from the extended-target are weighted with weights proportional to \gamma(x_i)/q(x_i), for i = 1,…,N.

In Stan generated quantities block

  ratio = num - denom;
  weights = exp(ratio - max(ratio));
  weights = weights/sum(weights);
  index = categorical_rng(weights)-1;
  z = x[index+1];

This is resampling importance sampling. If one of the normalized weights is 1 and others 0, then all draws are just a copies of one dynamic HMC draw from the extended distribution. Do you see the risk now?

1 Like

Oops, I go confused by non-standard notation in the paper, and now noticed that the weights are over the pseudo samples in each MCMC iteration and not over MCMC iteration, so that it’s not RIS as I first thought, but RIS anyway. I tested running their banana example and now I’m much more worried (and disappointed by the quality of the experiment in the paper)

fit2 <- stan(file = 'pseudo-extended_banana.stan', data = data)
Warning messages:
1: There were 1238 divergent transitions after warmup. Increasing adapt_delta above 0.8 may help. See
http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup 
2: There were 1956 transitions after warmup that exceeded the maximum treedepth. Increase max_treedepth above 10. See
http://mc-stan.org/misc/warnings.html#maximum-treedepth-exceeded 

Not good. Also Rhat’s are large and several n_eff<10. Weights are mostly 0 or 1. The main focus variables have Rhat 1.27 and 1.09 and n_eff 6 and 22. Running the pseudo-extended code they provide in github is not sampling from the target distribution.

Banana distribution with the specified parameters is difficult and the sampling without pseudo-extension fails, too, IF the default adpat_delta is used and the divergence warnings are ignored! Running with default adapt_delta gives the warning which recommends increasing adapt_delta. Increasing adapt_delta to 0.99 Stan is able to sample very well from the banana target distribution. Sampling time is less than 1s.

Increasing adapt_delta helps also with pseudo-extended sampling to get rid of divergences, but max treedepth exceedences stay, Rhat’s are large and n_eff’s are small. The main variables seem to have better behavior, but the sampling time is more than 20 times longer.

Conclusion: the banana distribution experiment was badly made, and doesn’t give support for usefulness of pseudo-extension. Other experiments may say something else, but now I’ll work on something else. I’m happy to see more experiments made by others.

5 Likes