There are 8 “logical” cores because Intel’s 4-core machines pretend to have 8 cores because of “hyperthreading”. Alas, it’s marketing, so you can’t run 8 jobs and expect 8-core performance.
This is the biggest question I have.
This is the kind of granularity that would be nice to control or better yet, automate.
Macs lag in cores behind their Windows and Linux counterparts. I have a mid-2012 15" Retina MacBook Pro on which I’ve done 100% of my Stan development. It has 4 cores. Apple only very recently (as in months, not years) released an 8-core equivalent. I’ll probably upgrade.
< Ironic comment >
We need a sponsor for the Stan Dev Team to supply 8-core laptops! This way more threading things will go into Stan. Intel provides the TBB - we should ask them for laptops like this for free for devs.
</ Ironic comment >
Currently, just our map_rect function when implemented with threading (it is also supported with MPI and with a serial implementation). With parallel HMC, we’d then be running parallel forward/backward in time and then within those simulations running parallel map functions.
But there are also proposals to parallelize distribution code and bits of autodiff by employing program transforms in the language before code generation.
So I’m interested in how we’re going to control all this parallelization going forward.
Hi everyone. I believe TensorFlow Probability has a working version of this idea:
If anyone is interested, Junpeng Lao and I (the creators) would be interested in presenting our approach and findings to you all.
One thing that is possibly different is that in addition to parallelization, we leverage numpy broadcasting semantics to some profit. Although our design is geared toward vector processing devices such as GPU, it is parallelizeable similar to the discussion here.
Bob: to your last point, we synchronize on evals to the user supplied target log-prob function (and its gradient). This means that across batches of chains, they are at a different point in the NUTS algorithm (tree, to be precise). In so doing we avoid stragglers.
Thanks Josh, I would like to also add that we did some extensive analysis on the recursive tree building algorithm (there are a bit more details here). Turns out it is not necessary to pre-sample all the fwd and bck up to the maximal treedepth, you only need to pre-sample fwd at the max tree depth, and read from the instruction (you need to be careful of flipping the sign in the backward tree). Also, you can have the same memory footprint as the recursive by doing memory access read/write analysis.
It seems pretty clear that if you are using the thread pool for something else, such as map_rect, then you shouldn’t be using the thread pool for speculative HMC. But that isn’t a reason to not implement / use speculative HMC when you are not using things like map_rect, particularly since map_rect can only be used currently with rectangular data structures.
Also for large models and users with large amount of cpu cores this probably won’t be a big problem (speculative hmc vs map_rect) and if TBB can handle resources then this would be great addition.
Yeah, but looking over the parallel thread (https://github.com/stan-dev/design-docs/pull/5) + considering the state of closures + the state of the new compiler, I think we’re getting close to some easy to use parallelism.
Thanks! I’m particularly interested in what you found by doing this.
Meaning you do a bunch in parallel? Neat that they don’t have to be at the same point in the NUTS algorithm. I always thought massively parallel NUTS didn’t make sense like parallel HMC because of the difficulty of synchronizing across stages of the algorithm. But you can still get chains finishing slower than others, it’s just that all log density evals get parallelized, right?
Yep, and you identify the difficulty precisely: for NUTS within a single step, some chain will finish before the other, which makes paralleling them difficult. Our strategy is to let the chains that already finished (u turn or divergent appear) keep doing leapfrogs along with the chains that are still going; it just that we discard those leapfrogs. Note here we are still doing u turn check and multinominal sampling as these operation are also paralleled, but we dont update the states.
Yes, please. I’m running a model now with 6 chains. Five of the chains completely finished sampling within 400s. The sixth chain is still doing warmup (less than 100 iterations) a few minutes after the other chains were done.
We did not parallelize across the forward and backward trajectories within the single chain (is it even possible? I have never thought about it but it might result in an interesting new dynamic termination).
Right, but if I understand correctly, that’s kind of a workaround as you dont actually want forward and backward simultaneously right? As you mentioned that:
If the aim is for each chain still follow the current tree doubling as in NUTS, the optimal way would be pretty close to what unrolled NUTS in TFP is doing. Instead of simultaneously building both forward and backward tree (2x more expensive in memory and computation), you only build tree from one direction, but you get the initial state and momentum depending on whether it is a forward or a backward tree, and “glue” the tree back to the correct end of the trajectory after one tree doubling.
Possibly relevant detail: we use tf.while_loop which automatically parallelizes all independent calculations. (This is “easy” for TF to do since we’re building a static computation graph which can be analyzed.)