Threaded AD

Hi!

… I still can’t believe it, but I think my idea from the last meeting to use thread_local storage for the AD stack just simply works. Using this I have implemented the map_rect function using C++11 async and future facilities in a threaded manner. See this branch.

To actually see it running you can run the unit tests in test/unit/math/rev/mat/functor/map_rect_serial_test.cpp.

Again… all I am using at this point are plain C++11 language features which are part of the official standard. This will work even on Windows!

So much for the good news… the bad news is that the Apple clang++ compiler creates a binary which fails to run the unit test. I had to use a g++ version 6 to make it work.

On preliminary examples I am seeing the usual speedups as seen with MPI. However, what is left to be done is to come up with a good thread pool implementation as from first tests it is clear that there is a lot of friction with thread creation and so on.

If someone else could confirm what I did is sound… that would be great, I still don’t trust it (even though I have seen exactly matching results, etc…).

Best,
Sebastian

5 Likes

Yes, this sounds right! It just avoids openmpi completely, right?

Correct. No openmp or mpi at all. Only c++11 language features are used.

1 Like

Are you saying this would work across machines or just for multi core?

Threading is typically just multicore, not cross machine. I don’t know if there’s a way to compile straight C++11 threads to use openmpi.

It’s still 🔥🔥🔥!

That’s my understand too!

Everything was designed with that in mind, so it should work. It worked originally, and we haven’t really changed any of the low-level autodiff pieces.

That’s really great.

Is thread spin-up and tear down expensive enough to use pools? I thought that was largely out of fashion because of lower level efficiency gains. I may have gotten details wrong or just been misled.

How are speedups compared to MPI? Of course, the nice thing is that this won’t need anything in the way of extra installs.

No. We still need the MPI facility to scale across machines. However, we should combine the techniques and use MPI to link different machines whiles using threads on a given machine. Threads should imply a lot less overhead in comparison to MPI.

Threaded multi-core processing should improve the performance scaling even more so than MPI. Currently MPI gives you linear speedups with cores for ODE problems while hierarchical (non-ODE) models do scale well only to a certain point - with threads we should be able to improve scaling here (given AD is still the bottleneck).

For the moment being I wanted to get a first prototype working and out there. See below for the code which I used. This is to get us going, but there are a few flaws with the async facility:

  • no control over the number of threads (you can only control how many async calls you generate)
  • the actual implementation is platform/compiler specific. So I do not have control over the use of a threadpool or not.
  • (not async specific) exceptions should not be emitted from the sub-threads as I understood. This could be the reason why Apple’s clang is failing and g++ 6 simply does a better job here. This is one of my todos to follow up on and understand what is the right thing here.

Again, if this works, then we should use this technique in a few more places:

  • we can almost double the speed of Stan by parallelizing the forward and backward time integration of NUTS by using two processes
  • we should give users a much easier construct other than the map_rect which needs packing + unpacking. My thinking here is that we can come up with a special target += parallel_add(... any expression...); notion. The parallel_add should be translated to a lambda functor such that in effect the user chunks his model with a series of parallel_add calls and these define the blocks of work which are dispatched and calculated asynchronously. The beauty of concentrating on target is that we do know that all additions to it are fully independent of one another. This syntax would completely avoid the need for packing + unpacking and it should be straightforward to implement since we can use the full flexibility of C++11. While this does not scale across machines like MPI, this will enable a much larger user base to apply this technique since it is a real burden to pack+unpack.

So let’s cross fingers that I have done things right here.

Here is the code excerpt from map_rect_async which does all the magic:

  std::vector<std::future<matrix_d>> futures;
 
  // queue jobs
  for (int i = 0; i < num_jobs; i++) {
    futures.push_back(std::async(std::launch::async, ReduceF(), shared_params_dbl,
                                    value_of(job_params[i]), x_r[i], x_i[i], msgs));
  }
  
  matrix_d world_output(0, 0);
  std::vector<int> world_f_out(num_jobs, -1);

  // collect results
  for (int offset = 0, i = 0; i < num_jobs; offset += world_f_out[i], ++i) {
    const matrix_d& job_output = futures[i].get(); // calling get on a future blocks execution until the result is available.
    world_f_out[i] = job_output.cols();
    ...

I could not resist to do preliminary tests. ODE problems get the usual linear speedups. For my usual model (2cmt oral dosing hierarchical popPK) I am getting with J=100 patients and 200 iterations some neat speedups:

  • when solving with the matrix exponential I get ~6.5x speed increase
  • when solving with an analytical solution I get a ~4.8x speed increase

I have asked the scheduler to make 10 cores available, but I don’t think that these were used ideally. One should probably block the execution somewhat as right now I create per patient a async call which should probably be grouped.

Idon’t think that MPI scales that well and I will do that once I am convinced that the threading performs well (and others have confirmed me it works independently).

The checks I have done so far:

  • cross-checking numerical results
  • running the Stan program in diagnose mode

all good so far.

1 Like

I just pushed to stan-math a version of map_rect_async which blocks the work into chunks instead of submitting each job individually. The number of chunks is roughly equal to the number threads which are going to fire up. This can be controlled with the STAN_THREADS environment variable.

From first tests I am getting very nice speedups I must say. Even an analytically solvable problem runs ~3.7x times faster when using 4 threads.

There are still some todos (exceptions, tests), but let’s cross fingers this lands soon.

4 Likes

The main advantages are that direct in-memory access is faster than communicating through serialization and we don’t have to copy data.

Is this the kind of thing where you’re imagining a job queue with associated thread pool? The primitives should be sufficient to build a thread pool.

The no exceptions condition is problematic. We’d have to wrap the thread code in something to catch exceptions and communicate failure back up to the map_rect call.

Not quite given the way the doublings work. Half the iterations are on the last doubling in a single direction. And the comparisons to stop extending are made with the opposite ends, not just locally within the doublings. So I think that’d be tricky.

Also, I wouldn’t worry about this so much, because I think a single machine can be kept busy just executing the log likelihood in parallel.

Right. Unlike with MPI, we aren’t shipping data out to the processor, so we can just bind all the data in the closure (lambda abstraction).

Do you think that can work with the parameters, too?

I dont’ know much about load balancing. It’s going to be a balance of keeping all the threads busy while minimizing communication and synchronization overhead.

This is great because I’d expect problems like race conditions to show up right away with this kind of thing.

What’s that comparing with? Is that against the base MPI implementation that’s already a bit faster than the non-MPI version?

I’m trying to figure out what the penalty is for using thread local qualifiers on the autodiff memory. It was around 20% when I first measured it ages ago. So I’m wondering whether there’s an offsetting speedup from the checkpointing (partial evaluation of nested partials).

By now I think we can go ahead and merge this threading thing very likely soon. I would like to run two more models which I trust and then I am good with it. So I read a few more details in the meantime

  • async when launched with the std::launch::async policy will always start a new thread with all thread locals correctly initialized, see 2) first point in the reference.

  • async wraps the return into a future which when queried for it’s result using get() will return the result or in case an exception has been raised, then it will just throw the exception at the point where get() is being called, see the Exceptions point in the reference.

Moreover, the unit tests for map_rect do involve exceptions and these just pass for the async implementation.

We can go with just async and chunk the incoming work into as many threads as we think at once. This will not give us fine-grained queuing control, but we will achieve that we utilize no more threads than we intend to. My preference is to stay with the high-level facility async which seems to be hassle free (managing the threads directly yourself is a lot more tricky, for example wrt to managing exceptions).

OK. map_rect_async is a great start.

Yes, we can also pack parameters in the lambda as long as these are just kept as references there. In the implementation I have done at the moment the sub-threads call value_of(some var) inside the sub-threads. That works just fine. All you have to make sure is to create new var instances only where you want them to end up since var creation will put them on the AD stack of the respective thread which does this call.

This is comparing single-core vs 4 core use for the threaded AD case. I will do comparisons with MPI speedups in a bit. Juggling all these branches is tedious and I first wanted to make sure that the threaded AD works correctly and speedy which it now does.

I haven’t seen any noticeable slowdowns yet. We simply have an AD tape per thread. All what the C++ implementation has to do is to use the thread specific AD tape for the var creations - that’s it. The only overhead is due to saving all results (including the gradients) in doubles which we inject only in the main thread into the AD stack. This last step could even be speed up a bit more: Right now big matrices are generated which store everything. What would be even more elegant is to create in the sub-threads operands_and_partials and then the build operation is called in the main thread (a delayed call of the build function). That may be even faster, but this memory management stuff is not really taking up a lot of time; at least this is my impression.

I think we should replace the map_rect_serial with the asynchronous one. Maybe I have a pull for this ready soon.

It might be too early to answer, but do the interfaces need to do something? @ariddell could we switch to Threading instead of multiprocessing, or is this just in-chain parallelism? Will this land in 2.19?

Thanks
-Ari

This is in-chain parallelism.

This is the striking beauty of this: We only require standard C++11 features. Thus, the interfaces do not need to provide anything in terms of making it work (much easier than MPI). This are the considerations I am aware for the interfaces:

  • The compiler needs to be able to digest C++11… and run it fine. Apples clang seems to be an exception here, but I do plan do detect Apples clang and then if-def-out the asynchronous thing. So programs will work with Apples clang; they just don’t run with parallelism

  • Right now the number of threads is controlled over an environment variable STAN_THREADS. This needs to be set and it controls the number of threads which are spwaned at most simultaneously.

If you ask me, this should be possible to land in 2.18. However, given that threaded AD was long thought to be impossible, I do expect that we are going to throughly evaluate what I have done here. I am good with it, but others in the stan-core team need to come to the same conclusion.

However, if you want hassle-free parallelism on a single machine, then I would anyone interested recommend by now the threading over the MPI stuff (if a single machine is sufficient). The amazing this is that this is rather easy to get going - just checkout cmdstan/stan develop and my stan-math branch with the threads. For me things work with g++ 6.3 and Apple clang segfaults on me right now.

Long answer cut short: No, the interfaces don’t really need to worry about anything.

1 Like

I’m going to mention a related issue here: it would be great if we could get basic parallelism working in addition to (or before) within-chain parallelism.

Models are not threadsafe. More specifically, autodiff in a single model is not threadsafe. This is tracked in https://github.com/stan-dev/math/issues/551 . If this could be fixed (and there’s no major obstacle to it being fixed, just developer hours), this would help PyStan tremendously because we could support parallel sampling in Windows with ease.

For me, as an interface maintainer, fixing this is my #2 priority for Stan (#1 is getting ess, rhat, and stansummary into stan::services).

Thanks for the ref to the ticket; I wasn’t aware of it. Using the thread_local keyword of C++11 is exactly what I am doing and it seems to just work fine…with gcc. For clang the situation is weird and I do not understand yet what is going on here are my findings from researching on it and trying things out:

  • Apple clang and clang 6.0.0 (non-Apple) both show the same behavior
  • The tests which use threaded AD quit with “an illegal instruction: 4” message
  • However, if I compile things with -O0 then things work! Starting with -O1 I am getting segfaults
  • People seem to do weird stuff: https://github.com/wang-bin/ThreadLocal
  • Or look here: https://reviews.llvm.org/D42742 (people say that __cxa_thread_atexit is not implemented by clang yet, only by gcc)

I think that I have done things according to the standard (because I think so, that does not mean it really is). Currently clang is not able to deal with this, but gcc is doing just fine.

Solutions I see:

  • ignore clang for now until they get this right
  • debug more
  • use something like boost thread specifc ptr or that ThreadLocal thing which I linked above (mozilla licensed…no idea if that is ok or not)

In any case, the performance gain is really cool.

@ariddell: What is actually the motivation to make stan model classes thread safe? I mean you can just start different processes for different models or chains. To me the real value is in-chain threads. Is this a PyStan specific thing?

However, in any case, you may want to try out the branch I link above on stan-math. As far as I am aware the limiting thing for threading is (or better was) the AD stack. This should be solved now as the stack is thread_local.

Good news: The branch now compiles and runs with g++ and clang++.

I noticed that the clang project itself uses thread_local techniques, like here:

So basically what this means for us is to add a static function context() to the AutodiffStackStorage. This function holds an instance of a static thread_local variable which is an instance of the AD stack. All the function does is to return a reference to this stack.

Our remaining code-base then needs to access the stack instead of using ChainableStack:: as prefix, we now just use ChainableStack::context(). as prefix.

This seems to do the trick. Everything looks like it is working. I also did run the tests in test/unit/math/rev/core and all is good.

From very basic performance tests I do not see any difference in speed. So the extra function call is likely eliminated from the compiler such that there is no overhead incurred, I think (but confirmation from others would be great).

Should we continue discussion in a pull directly? Or a WIP pull?

@wds15 Launching separate threads to do parallel sampling turns out to be easier than launching separate processes (what we do now). I think this is true for most languages, not just Python. And having these threads share a single Stan model (and associated user-provided data) is completely natural.

One reason why this is the case: there’s a lot more operating-system overhead associated with launching a new process than launching a new thread.

There maybe additional overhead, but one gains a huge plus in terms of robustness as any failure will only affect a given process, but not all threads. This is well worth the additional operating system overhead to me.

Things may look different if you want to do httpstan or other similar stuff, I agree.