Adaptive step-size method in place of Leapfrog

LJDevlin · June 13, 2019, 12:24pm

Hi all,

Has an adaptive step-size numerical method been considered in place of Leapfrog? There is little in the literature on this but there is one paper (https://www.jstage.jst.go.jp/article/jsiaml/9/0/9_33/_article
) on the topic based on Hairer’s 2005 paper (Explicit, Time Reversible, Adaptive step size control) which showed some interesting results. I think applying this method to an arbitrary problem is perhaps non-trivial but I’m interested to know whether it has been considered for Stan.

Many thanks

Lee

betanalpha · June 13, 2019, 3:32pm

Yes, and in initial experiments they almost always prove to be extremely fragile and require careful tuning of the adaptation routine to be able to adapt to the specific range of curvatures needed to work reasonably well on any given non-trivial model.

LJDevlin · June 14, 2019, 10:05am

Interesting, thanks for the insight Michael.

betanalpha · June 14, 2019, 4:19pm

Most of the existing methods for simulating Hamiltonian dynamics come from the molecular dynamics literature where implementations are highly tuned to specific problems. “Adaptive step size” doesn’t mean “adapt the step size to any problem” it means “adapt the step size to the particular distribution of curvatures encountered in this problem”. Applying algorithms tuned for specific problems on other problems results in fragility, where the adaptation isn’t able to generalize and yields poor performance.

Because the nature of Stan is to let users specify an arbitrary problem while abstracting as much of the computation as possible, these methods are typically unsuccessful or would require way too much input from the user.

yizhang · June 15, 2019, 6:15am

Leapfrog is nice because it’s simple and yet symplectic(perserve phase-space in certain sense). Currently there are no well-established adaptively-stepped symplectic integrators, as few proposed ones have gone through battles like leapfrog.

s.maskell · June 17, 2019, 1:52pm

Interesting. For me, the real issue is that MCMC puts constraints (largely stemming from detailed balance, as I see it) on the proposal that make life harder than we would like: there’s nothing wrong with using leapfrog and, as Yi Zhang highlights, it’s simplicity is very appealing, but that doesn’t mean we shouldn’t aspire to do even better. More specifically, I believe it is possible to define adaptively-stepped symplectic integrators that can address the fragility issue highlighted by both Lee and Michael, ie can be readily applied to general problems without any problem-specific tuning.

However, I think we will need to move away from MCMC towards alternative numerical Bayesian frameworks to accommodate integrators that are not strictly time-reversible. It is probably not coincidence that we (a team that include Lee and I) are working towards integrating one such framework (SMC samplers, which don’t require detailed balance and so can accommodate a richer set of numerical integrators) alongside Stan’s existing back-end: watch this space!

Simon

sakrejda · June 17, 2019, 2:02pm

Sequential Monte Carlo has its own problems so it’s not a good choice for inclusion generically (we don’t include pure MH either!) so you might have better luck with this discussion if you pointed to a specific algorithm you are interested in including. You might also want to check the mailing lists for discussion around the inclusion of ADVI.

s.maskell · June 17, 2019, 4:13pm

Thanks. We’ll happily read up on ADVI.

I’m intrigued to better understand the problems you feel “SMC” has (noting that “SMC” can refer to both particle filters (which are not generically applicable since they demand time dependence) but also to SMC samplers, which can be configured to be a drop-in replacement to MCMC and which can, very interestingly, I think, remove the need for detailed balance). We’ve got a few papers on SMC samplers as a generic alternative to MCMC that are out there. See, for example: https://arxiv.org/abs/1905.10252. However, the specific algorithm (that uses an alternative to leapfrog) is something we are working on, rather than something we can point at now. Note that if it’s not approrpiate to post discussions related to potential long term Stan development here, we’ll happily not do so again: we’re new here!

Lee’s initial query was really aimed at understanding what appetite there was for an alternative to leapfrog and what experience the community has of trying such things. It seems our concerns about generic applicability and the challenges we’ve encountered with using alternative integrators are mirrored by those posting here. In some sense, that’s reassuring. That said, I would welcome views on the appetite for alternative integrators (assuming we could get them to “work”).

Cheers
Simon

sakrejda · June 17, 2019, 7:59pm

Hi Simon,

Discussing possible alternatives here on discourse is definitely the right place, there’s no problem with doing that. There’s plenty of appetite for alternative implementations that might be faster but you’re also unlikely to get much interest without a demonstration that shows robust scalable behavior in a variety of problems, including non-toy problems. The computation in the HMC/NUTS algorithm in Stan does provide lots of opportunities for parallelization by pushing the computational burden within-iteration so demonstrating an improvement using a specialized SMC algorithm may be difficult.

s.maskell · June 17, 2019, 10:20pm

Thanks for the clarity and steer.

In fact, we have already had some helpful input that we should demonstrate performance on 8 specific problems, each of which Stan is currently configured to tackle. We will do that and use the large computers we have access to with a view to showing off what’s possible (using NUTS or NUTS-like proposals within an SMC sampler to ensure we capitalise on the hard work that’s been done to date and as an alternative backend that provides a sibling algorithm to ADVI and NUTS for those people that want to use an SMC sampler instead).

You are also correct to infer (if I read between the lines of your post correctly) that between-sample parallelization has been a focus of our some of our work to date: that’s mostly because such parallelization is problem agnostic and there are specific use cases that motivate (and are funding) our work which relate to exploiting such parallelism at scale (and because people perceive it’s harder than we have shown it is). I do think we could also exploit within-sample parallelization within an SMC sampler if that were of interest, though that hasn’t been our focus to date.

For me, another interesting aspect relates to the fact that, relative to MCMC, the SMC sampler has an additional design parameter, the L-kernel, which makes it possible to shrug off the need for detailed balance. This in turn opens the door to improved performance for a given proposal and improved proposals. Coming full circle to the initial post in this thread, one thing that we are confident that would let us do is to have an adaptive numerical integrator. There are other potential avenues to explore and we’re really interested in doing so in collaboration with people involved in Stan’s development.

We want to integrate what we are doing with Stan so that the user community can capitalise on our advances. We recognise the importance of doing that in a way that complements the extensive developer community’s historic, ongoing and future activities. However, we’re also coming to things afresh and therefore likely to ask “silly” questions and/or be behind the curve on certain thought processes. Hopefully we can get up to speed and become helpful quickly.

Many thanks again,
Simon

Bob_Carpenter · June 18, 2019, 4:47pm

We recommending starting with the examples in the MCMC benchmarks repo. Those are the ones where we’re confident we’re getting the right answer with HMC. The evaluations aren’t even particularly stringent. I’m not even sure they evaluate boht X and X^2, which can vary wildly in efficiency in HMC due to antithetical sampling.

Bob_Carpenter · June 18, 2019, 4:53pm

Am I understanding this abstract correctly (my emphasis)?

In this paper, we first propose a parallel MPI version of the SMC Sampler, including an optimised implementation of the bottleneck, and then compare it with single-core Metropolis-Hastings. The goal is to show that SMC Samplers may be a promising alternative to MCMC methods with high potential for future improvements. We demonstrate that a basic SMC Sampler with 512 cores is up to 85 times faster or up to 8 times more accurate than Metropolis-Hastings.

Given that Metropolis-Hastings with 512 cores is 512 times faster or more accurate than Metropolis-Hastings (as measured by ESS/second), it seems like SMC is only about an eighth as fast as Metropolis-Hastings per flop.

NUTS is orders of magnitude faster than random-walk Metropolis per flop in even moderately high dimensions.

Thanks for asking—you’re in the right place for these discussions.

s.maskell · June 18, 2019, 5:29pm

Yes. Those are the ones we will benchmark performance on.
Simon

Ps I don’t know how to cut and paste content to make clear which subthread i am responding to… help welcome!

s.maskell · June 18, 2019, 5:50pm

Hi bob

Re: speed up… The focus of the paper is exactly on speed up per second and showing that, for the first time that we have seen and in the hardest setting we can find in terms of achieving a speed up, generic smc samplers can beat generic mcmc in terms of ess/second. We don’t discuss per flop speed up in that paper and we don’t explore realistic problems or decent proposals; we think we know how to integrate nuts into an smc sampler within stan but that’s not a paper i can point at now (and, somewhat strangely, it actually improves the smc sampler efficiency to use complicated proposals). That will all come later along with a more detailed comparative analysis; the specifics of how and when one algorithm wins is complicated to explain and seems to be dependent on a number of factors. Today, I can only say; “watch this space”.

My feeling is that the paper is a point on a trajectory. It certainly isn’t the end. We want to engage now so you can help us hone our argument and demonstrate our forthcoming advances in a way that enables the stan community to see the merits of what we have done. Perhaps most usefully, the paper describes the configuration of the smc sampler we are using and makes clear that we don’t need any sequential problem structure, for example.

Best
Simon

Bob_Carpenter · June 18, 2019, 6:32pm

Select the text you want to comment on and a “quote” button pops up. Clicking it inserts a quoted message like the one above.

If you run 512 parallel Metropolis chains, you should get roughly 512 times the effective sample size. That would beat your 512 core SMC implementation, which you report is 85 times faster.

The only thing I can conclude from the abstract is that your SMC implementation is much slower than your Metropolis implementation per flop.

s.maskell · June 18, 2019, 6:57pm

Yes, for today’s worst case. Our near term focus for stan integration is on realistic problems and exploiting, for example, nuts.

More specifically, the smc sampler in the paper would be much slower than parallel mcmc chains in the worst case we were focusing on where you are probably correct that n multiple mcmc chains would give n-fold speed up. That assumes you jump to the typical set and have a proposal with no issues of burn in (which we don’t think is always the case). The specific “toy” problem in the paper is such that any proposal can do that. However, with more complicated proposals and problems (including those where efficient data-driven proposals are hard to come by, eg involving reversible jump), the picture changes. Explaining that in detail in the specific paper was something we considered out of scope.

Simon

Ps Note also that the paper describes an mpi only implementation; using shared memory, we can do better though the picture seems to get more complicated!

yizhang · June 18, 2019, 8:47pm

Regardless being symplectic or not, to me it’s very hard to come up an adaptive integrator that’s universally applicable to various models that Stan allows. There are models with only a few DoF that appeal to be simple and yet can fail adaptive rk45. So to me an adaptive scheme hidden inside stan core with knobs to turn will make it very hard for model diagnosis.

s.maskell · June 18, 2019, 8:55pm

I’m not sure I am parsing your post correctly, but I can assure you that we are working to develop something that is generically applicable and has no knobs to turn.

yizhang · June 18, 2019, 8:56pm

That’d be brilliant! I’ll be excited to try it out.

s.maskell · June 18, 2019, 8:59pm

Great. That is exactly why we want to post here!

Topic		Replies	Views
HMC (jittered) vs. NUTS on 1000-dimensional standard normal Algorithms mcmc	9	4047	April 29, 2019
Comparing Stan's adaptation phase to that of nuts-rs? Algorithms	20	1882	August 11, 2023
Request for Volunteers to Test Adaptation Tweak Algorithms	103	5905	October 17, 2019
HINTS: An alternative to NUTS that doesn't require gradients Algorithms mcmc	40	4834	September 22, 2019
Parallelizing the sampler (not the model) Developers	22	2528	August 16, 2022

Adaptive step-size method in place of Leapfrog

Related topics