Within-chain parallelization idea (maybe crazy)

Had a first pass at this, not compiling everything yet but just trying to sort out a few things. Instead of having a mutex around the uniform rng sampler I just have 1 uniform rng sampler per thread. That and I pulled this out to be it’s own class (base_parallel_nuts) and am writing the service API route rn.

@maedoc once we have the cmdstan API layer working for this could I ping you to try it out? (also if you can use a lot of vectors/ matrices and do lots of vectorized ops in your model code turning on --O1 in the latest Stan compiler may speed things up for you as well). Do you know how deep your average tree depth is per transition? The scheme here mostly help when you have a deep-ish tree depth ( > 6). Running multiple chains within one stan program using the cmdstan num_chains argument may help out as well since that will let each gradient calculation across all the chains share data.

@wds15 once I get the below passing some unit tests I’ll put up a PR and then make the cmdstan API

1 Like

sure I’d be up to try it out. Thanks for the tip.

Yes, 8 to 10.

Awesome!

I remember that I struggled with the rng. I came to the conclusion that the mutex does not hurt performance, since we very rarerly draw random numbers during sampling. The solution to have one rng per thread is not ideal in terms of reproducibility. Maybe sampling the left/right turns in advance is a better choice here… but we can revisit that at a later stage. How about adding a note to the source that we need to think about rng number handling a bit more?

Really cool to see this moving!

1 Like

If you want a model (+R code to generate data) that meets these criteria, see here (it’s fundamentally the SUG1.13 model, but highly optimized)

1 Like

For (1), we’ve found the key is breaking the data into chunks each of which requires a lot of work without requiring a lot of communication overhead past the data. The best example of this is differential-equation based models, such as the PK/PD models used in pharmacology, where we have to solve a differential equation for each data point. The number of parameters of the ODE systems is relatively small, so this shows nearly linear speedup with cores (up to around 32 cores, which is what @wds15 was testing with if I recall correctly).

I don’t understand point (2). A long Hamiltonian trajectory just evaluates the likelihood (and prior) a bunch of times. Is there a reason it wouldn’t have exactly the same relative speedup as for short chains?

1 Like

Hi Bob, for (2), in my experiments, the overhead of the internal machinery of HMC started to increase with the trajectory length. This results in the achievable speedup of likelihood parallelization diminishing as a consequence of Amdahl’s law.

@Red-Portal: That’s interesting. I don’t think we’ve ever measured per-iteration time as a function of tree depth. Nor am I really sure how to do that since the tree depth is controlled automatically by NUTS. Intuitively, each step is the same trajectory computation and the only additional overhead for NUTS is the RNG to determine direction of next expansion, recursion depth and the extra memory of \log_2 d for tree depth d. This is also making me wonder if there’s some kind of memory bug in our implementation of NUTS. Presumably you wouldn’t find the same thing with basic HMC, where there’s no intermediate state storage and thus no additional overhead for longer trajectories. Going backward and forward in time will also mess up memory locality compared to basic HMC.

@Bob_Carpenter That makes a lot of sense. I think it’s also an important research topic since NUTS/Dynamic HMC is pretty much industry standard now. I plan on investigating the various (empirical?) scalability issues with variants of HMC in more detail very soon. I’ll share if I find anything of interest to the Stan devs.

1 Like

@maedoc alrighty give this a whirl! From cmdstan you should be able to just do

git pull
git checkout feature/parallel-diag-e-adapt

# I can't remember if git is smart enough to grab the new 
# submodule so may need to
# cd ./stan 
# git pull 
# git checkout review1/speculative-nuts
# cd ..
make clean-all

# If you don't have Stan threads on run the below
# echo "STAN_THREADS=true" > make/local
# echo "O=3 -march=native -mtune=native" >> make/local

make -j4 build
make ./examples/bernoulli/bernoulli
./examples/bernoulli/bernoulli sample parallel_tree=1 \
 num_chains=4 data file="./examples/bernoulli/bernoulli.data.R" num_threads=8

That should work (at least it runs locally for me). Though note right now the only path that is supported is adapt diag_e. Once we have adapt diag_e working we can build out the service APIs for the other nuts options.

Part of me gets this could be true but the other part of me is sus about being able to ballpark how good/bad the mutex is. Since we have two branches now we can try out something like @mike-lawrence 's code with both branches to see if there is any noticable difference.

Even with the mutex though is there any hope of exact reproducibility between runs? By that I mean as we keep calling the rng when we build the tree, we cant decide for each execution whether thread 1, 2, or 3 goes off first for a forward or backward iter. Since that can change between runs idt there’s a way to be exactly reproducible.

As I understood it, we spend our computational time on calculating the gradient of the log likelihood. Nothing in that is random such that we only need randomness very rarely. Thus, during tree building of the NUTS sampler drawing random numbers is nothing you need a lot.

I am on board with letting reproducibility be a sacrifice for speed… as long as I have options. Ideally this option does not force me to not use multi-core speedups, but rather give a speed penalty. Much like reduce_sum will never be reproducible, but reduce_sum_staticis a bit less optimal in terms of speedup yet it is a huge gain in speed. In any case, we should for now just move ahead and make progress. I am the one asking for reproducibility so I am happy to think about it as we progress.

@stevebronder thanks! I tried it out locally, everything compiles and runs with the Bernoulli example. I then tried the following trivial high tree depth model, with the options sample parallel_tree=1 num_threads=2,

data { int N; }
parameters { vector[N] theta; }
model { target += sum(sin(theta)); }

Is it expected that (for a model with no use of map_rect or reduce_sum) CPU usage stays at 100%? My understanding was that for higher tree depths, top should show >100% CPU usage or I’ve misunderstood ? Is it then the case that the model/grad evals are serialised? Apologies in advance if this is just my naïveté.

can u try something with more cpu need? Maybe some log-gamma calls?

Yeah I think your like 100% right. tbh I think I might have cargo culted being anti-mutex, though I think the impl looks reasonable as it is.

Does the mutex scheme allow reproducibility though? My Q above is that if we have say 4 threads and are doing 4 depth searches of the tree, we can’t guarantee that any thread is going to call the uniform rng in any particular order across the parallel depth searches. Is that right? Though tbc I’m not saying the scheme I wrote is reproducible either somehow

No, the mutex does not give us reproducibility, that’s right. At the time I wanted the simplest thing to do in order to test speculative NUTS. I am only saying we should think about this problem down the road. For now need to settle on if we want this feature for Stan (community perspective) and then we decide about details like these.

@maedoc Oh sorry there was a bug! Can you try running the below? I also setup your example in the examples directory for easier testing. I’m getting 165% CPU usage with the below example so I believe it’s working now

git pull
git checkout feature/parallel-diag-e-adapt

# I can't remember if git is smart enough to grab the new 
# submodule so may need to
# cd ./stan 
# git pull 
# git checkout review1/speculative-nuts
# cd ..
make clean-all

# If you don't have Stan threads on run the below
# echo "STAN_THREADS=true" > make/local
# echo "O=3 -march=native -mtune=native" >> make/local

make -j3 ./examples/deep_ex/deep_ex
./examples/deep_ex/deep_ex sample parallel_tree=1  num_chains=1 data file="./examples/deep_ex/deep_ex.data.R" num_threads=2 output diagnostic_file="./diagnostics.log"

(it’s enough to git pull && git submodule update to get the ./stan submodule on the right commit, tho it doesn’t put it on the corresponding branch)

Thanks, that’s working and I see the same usage as you mentioned and a faster overall time for running the model. Increasing N leads to increased CPU usage on average, which is what I had understood intuitively. Two more questions: (a) this branch implements the parallel tree for the default adaptive metric, right? (Not sure with the e in diag-e means) (b) would parallel calls to an external user function be able to provide a thread id or something so that the external user function would know which thread the call corresponds to? Or perhaps in the C++ impl, there’s a thread local trick to use? (I can look myself later if it isn’t obvious)

1 Like

Why would you need to know that? Asking this question sounds like a dangerous thing to me! Your code should be written in a way that it executes in a running thread without such requirements. The user function called gets a thread to use and there is no more the function should need to know.

… but yeah… you can query a thread id from the Intel TBB which is used to manage the work load. So if you need that really, then look there. But be warned that this is likely a bad idea…maybe it’s not in your case, of course.

1 Like

I agree with your general sentiment. The implementation (for which I’m still sketching out the details) will push the parameters over MPI to several nodes for gradient evaluation. If those gradient evaluations are partly parallel instead of sequential, then I would double the number of nodes, and depending on the thread id, push to a different set of nodes, in order to distribute work correctly. Does that sound less strange? I would assume there’s a similar mechanism for sharding map_rect work as well.

Ok…this sounds plausible…but this way the thread sending out the work must wait in place for the results to get back. As such you loose the threads ability to work for you or is this taken care of? The tbb can suspend idle tasks an run other things in the meantime if you let the tbb do that.

Map rect is not so clever here. It was written before the threading stuff was included and as such won’t work well with threading. Essentially one thread will lock the mpi resources and give back control over it once the job is done in the context of that thread. All other threads will try to get a hold of the mpi resources, fail to get it and fall back to execute things serially.

We should have a mpi job dispatcher which takes work from any thread, maintains a queue and hands back results once they become available. Given the ever increasing cpu cores on a given machine, the need for that diminishes (but it would be cool, of course).