Parallel dynamic HMC merits

Bob_Carpenter · September 13, 2019, 12:41pm

I think the biggest problem we face is balancing this

with

I don’t think it’s an obvious “of course” here. We could quite sensibly decide to prioritize high-end applications on clusters to solve hard problems.

Like me, you have 4 until you get a new laptop. Here’s how to check in R:

> library(parallel)
> detectCores(logical = TRUE)
[1] 8
> detectCores(logical = FALSE)
[1] 4

There are 8 “logical” cores because Intel’s 4-core machines pretend to have 8 cores because of “hyperthreading”. Alas, it’s marketing, so you can’t run 8 jobs and expect 8-core performance.

This is the biggest question I have.

This is the kind of granularity that would be nice to control or better yet, automate.

Macs lag in cores behind their Windows and Linux counterparts. I have a mid-2012 15" Retina MacBook Pro on which I’ve done 100% of my Stan development. It has 4 cores. Apple only very recently (as in months, not years) released an 8-core equivalent. I’ll probably upgrade.

wds15 · September 13, 2019, 12:53pm

< Ironic comment >
We need a sponsor for the Stan Dev Team to supply 8-core laptops! This way more threading things will go into Stan. Intel provides the TBB - we should ask them for laptops like this for free for devs.
</ Ironic comment >

mcol · September 13, 2019, 2:00pm

What is there to be cannibalized now? Are you talking about future approaches to parallelizations that are not here yet?

Bob_Carpenter · September 13, 2019, 2:07pm

Currently, just our map_rect function when implemented with threading (it is also supported with MPI and with a serial implementation). With parallel HMC, we’d then be running parallel forward/backward in time and then within those simulations running parallel map functions.

But there are also proposals to parallelize distribution code and bits of autodiff by employing program transforms in the language before code generation.

So I’m interested in how we’re going to control all this parallelization going forward.

jvdillon · September 13, 2019, 3:40pm

Hi everyone. I believe TensorFlow Probability has a working version of this idea:

github.com

tensorflow/probability/blob/master/tensorflow_probability/python/mcmc/nuts.py

# Copyright 2019 The TensorFlow Probability Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
"""No U-Turn Sampler.

The implementation closely follows [1; Algorithm 3], with Multinomial sampling
on the tree (instead of slice sampling) and a generalized No-U-Turn termination
criterion [2; Appendix A].

This file has been truncated. show original

If anyone is interested, Junpeng Lao and I (the creators) would be interested in presenting our approach and findings to you all.

One thing that is possibly different is that in addition to parallelization, we leverage numpy broadcasting semantics to some profit. Although our design is geared toward vector processing devices such as GPU, it is parallelizeable similar to the discussion here.

Bob: to your last point, we synchronize on evals to the user supplied target log-prob function (and its gradient). This means that across batches of chains, they are at a different point in the NUTS algorithm (tree, to be precise). In so doing we avoid stragglers.

junpenglao · September 13, 2019, 4:51pm

Thanks Josh, I would like to also add that we did some extensive analysis on the recursive tree building algorithm (there are a bit more details here). Turns out it is not necessary to pre-sample all the fwd and bck up to the maximal treedepth, you only need to pre-sample fwd at the max tree depth, and read from the instruction (you need to be careful of flipping the sign in the backward tree). Also, you can have the same memory footprint as the recursive by doing memory access read/write analysis.

bbbales2 · September 15, 2019, 6:34pm

I agree. This puts me pretty firmly in the not-implement camp.

I did look at the code – the TBB code here looks pretty sweet for what it’s worth.

bgoodri · September 15, 2019, 6:56pm

It seems pretty clear that if you are using the thread pool for something else, such as map_rect, then you shouldn’t be using the thread pool for speculative HMC. But that isn’t a reason to not implement / use speculative HMC when you are not using things like map_rect, particularly since map_rect can only be used currently with rectangular data structures.

ahartikainen · September 15, 2019, 7:01pm

Also for large models and users with large amount of cpu cores this probably won’t be a big problem (speculative hmc vs map_rect) and if TBB can handle resources then this would be great addition.

bbbales2 · September 15, 2019, 7:09pm

Yeah, but looking over the parallel thread (https://github.com/stan-dev/design-docs/pull/5) + considering the state of closures + the state of the new compiler, I think we’re getting close to some easy to use parallelism.

Bob_Carpenter · September 16, 2019, 1:11pm

Thanks! I’m particularly interested in what you found by doing this.

Meaning you do a bunch in parallel? Neat that they don’t have to be at the same point in the NUTS algorithm. I always thought massively parallel NUTS didn’t make sense like parallel HMC because of the difficulty of synchronizing across stages of the algorithm. But you can still get chains finishing slower than others, it’s just that all log density evals get parallelized, right?

junpenglao · September 16, 2019, 2:14pm

Yep, and you identify the difficulty precisely: for NUTS within a single step, some chain will finish before the other, which makes paralleling them difficult. Our strategy is to let the chains that already finished (u turn or divergent appear) keep doing leapfrogs along with the chains that are still going; it just that we discard those leapfrogs. Note here we are still doing u turn check and multinominal sampling as these operation are also paralleled, but we dont update the states.

Joshua_Pritikin · September 16, 2019, 4:14pm

Yes, please. I’m running a model now with 6 chains. Five of the chains completely finished sampling within 400s. The sixth chain is still doing warmup (less than 100 iterations) a few minutes after the other chains were done.

seantalts · September 16, 2019, 9:37pm

Hi! Thanks for posting, I’d definitely be interested in attending some kind of virtual talk about your and @junpenglao’s work.

Are you saying that you vectorize and parallelize across the forward and backward trajectories within a single chain as well as across chains?

junpenglao · September 17, 2019, 4:56am

We did not parallelize across the forward and backward trajectories within the single chain (is it even possible? I have never thought about it but it might result in an interesting new dynamic termination).

seantalts · September 17, 2019, 12:33pm

I think that’s what’s being described in the original post for this thread, right @wds15?

wds15 · September 17, 2019, 1:00pm

Yes. Forward and backward sweeps run simultaneously.

junpenglao · September 17, 2019, 1:32pm

Right, but if I understand correctly, that’s kind of a workaround as you dont actually want forward and backward simultaneously right? As you mentioned that:

If the aim is for each chain still follow the current tree doubling as in NUTS, the optimal way would be pretty close to what unrolled NUTS in TFP is doing. Instead of simultaneously building both forward and backward tree (2x more expensive in memory and computation), you only build tree from one direction, but you get the initial state and momentum depending on whether it is a forward or a backward tree, and “glue” the tree back to the correct end of the trajectory after one tree doubling.

jvdillon · September 17, 2019, 3:57pm

Possibly relevant detail: we use tf.while_loop which automatically parallelizes all independent calculations. (This is “easy” for TF to do since we’re building a static computation graph which can be analyzed.)

Topic		Replies	Views
Within-chain parallelization idea (maybe crazy) Developers	35	2768	February 24, 2022
Recommendations for problem-specific Stan-esque dynamic HMC implementation General algorithms	2	715	April 13, 2022
Parallelizing the sampler (not the model) Developers	22	2389	August 16, 2022
Chain length Modeling	2	1016	January 5, 2018
Tried to Speed Up Stan (HMC) by Rewriting Dynamics, failed Algorithms	2	664	June 17, 2020

Parallel dynamic HMC merits

Related topics