Threaded AD

math

#21

At least with PyStan one failing process will fail the other ones.

Also multiprocessing uses pickle with “wrong” protocol, so it will fail with large data.

Also multiprocessing needs special attention which is not a good thing.

In short cpu/gpu handling should go to C++ side and filesystem stuff to Python side.


#22

Isn’t this controlled by the users’ invocation of map_rect? That is, aren’t all the map jobs run in parallel?

I think map_rect_concurrent or map_rect_threaded would be a better name, as it’s not fully asynchronous. We’re waiting on a future at some point.

The algorithms need to know what the parameters are, at least the ode_integrate does in order to be able to create the coupled system. How do we do that with closure-bound parameters?

Yup. I usually just wind up with multiple copies of cmdstan floating around, then never know where to find the one I want since I’m not used to multiple copies.

That’s all it ever did, but there used to be a pretty big slowdown just for doing that.

Yes, and I hope we get there in the design. I think getting the API right and picking the low hanging fruit first is the right strategy for moving forward.

I would prefer to have this configurable, too. We want people who run four parallel chains in parallel on a four-core machine to not also try to multi-thread four ways within each of those chains.

Sounds like the right thing to do.

Is the idea that if we have only STAN_THREADS = 4 and we have a map_rect call with 20 jobs, we make sure we only run 4 threads at a time?

Not at all. I designed for this from the very beginning. The reason we never turned it on before is that there wasn’t an easy way to do multi-threading cross-platform before C++11, we didn’t have the map_rect design with all the back-end work you did on a parallelizable function, and we were seeing big slowdowns while running multi-threaded. Then when I saw thread local was a thing in C++11 along with multi-threading, I made this issue before I really knew the details for C++11. Sebastian then sorted it all out!

Actually, the models are threadsafe now other than the line number counter. So that’s going to have to be made thread local, too, if we want reasonable error messages. If there is negligible overhead on this, then no big deal.

I’d still like all the thread-local declarations to be conditional on running multi-threaded code unless we can show there’s very low overhead to making thread-local declarations.

This is often a sign that you’re doing something like making access to illegal memory. The higher optimization levels are much better at reusing memory layout.

I’d rather avoid MPL if possible, as it’s a bit more restrictive than BSD.

If they’re not, running multi-threaded won’t work. They’ve been kept immutable after construction with one exception I’m aware of other than the line number.

That’s great news.

This is a better approach to singletons in C++11 anyway. I’ve been meaning to reconfigure everything in math to do this anyway. Does the thread local work on local function static variables?

The compilers are very good at inlining simple function calls.

Absolutely. Multiple threads are way better when possible.

Exceptions that aren’t caught or seg faults are going to cause unrecoverable crashes in whatever we do unless we start engineering for failover.


#23

The way I am suggesting to implement this new map_rect_concurrent is to divide the N jobs from the user into T chunks. All T chunks are then send of to be calculated in parallel at once using the async facility. So N will usually be a large number N >> T and T is something close to the number of cores.

Like usual: Using the type of the variable. We can refer to var instances in other threads as long as we just read from them. That works fine.

But do we really need to make it a compile time choice? What I am proposing is to make this a runtime choice. So if STAN_THREADS is not defined or is set to <=1, then only a single thread will ever run (so no sub-threads). Whenever STAN_THREADS > 1 we would start to queue multiple threads and run things concurrently. I haven’t noticed any slowdowns - and we should check that thoroughly. However, if we can avoid to keep a serial and a concurrent version around, we should just go with the concurrent version. Makes maintenance easier.

Correct.

All I did is to find out what others do here. The thread_local thing is really magic and having seen some of the thread-nightmare codes of others I think there was a good reason to wait until C++11 brought all these wonderful high-level facilities (despite their limitations).

Yes, the code right now relies on a thread local variable declared as static inside a static function. This is what seems to work for g++ and clang++.

As discussed, let me file a pull with the threaded AD stuff first. Once we get that merged we can turn map_rect_serial into map_rect_concurrent.


#24

It’s not that bad :)


#25

That should be reasonable. The only cocnern is load balancing if one chunk winds up taking longer than others.

So you mean walking over the expression graph in the AST and fishing out any uses of variables and building up a container of those automatically to pass in?

That’s great. I just saw an environment variable and thought it’d be static.

That’s one of the reasons the autodiff code’s set up the way it is. Back when we started Stan, I thought multi-threading would be very important. We just never had the map_rect infrastructure or portable C++11 threading lib before to make it useful.

Posix threads for serious applications? It’s pretty rough. It’s like numerical analysis in that it’s full of fussy side conditions and it’s like going to higher dimensions in that none of your single-threaded intuitions hold any more. Oh, and it’s really really really hard to trap and debug race conditions and other memory corruptions.


#26

I was thinking a single threadpool for limited applications like we currently are planning with futures.


#27

Uhm… I think the answer is yes, but I am not sure. To my understanding if you pass into sub-threads const var& then you will be safe. Thus, you can read in the sub-threads from an AD stock of the main process, but not more.

No, static makes no sense here. We should probably go one step further and define a value of “0” to imply the use of all available cores. Once the thread AD pull is in, I will file the map_rect in a threaded version.

Yeah. It is funny that I had to embark on MPI to get a facility which works splendid with threads. Nice “side-product” which I never thought of before.

We are not going for a threadpool to start with. A threadpool will probably be a good thing to implement at a later point. For the moment being I am suggesting to go with the async facility of C++11. What you get exactly with that is compiler/platform specific and is not under our control.


#28

In no way am I suggesting complicating the current PR. I’ve implemented a threadpool a few times and it’s something I can do later to replace futures later.


#29

Threading just got merged, so you can try it for PyStan.


#30

Can you point me to the PR # or the commit? I didn’t spot it immediately.


#31

This?

So if PyStan uses n_jobs to choose how many cores are available for processes, should n_threads be same or not?


#32

The n_jobs parameter name is taken from scikit-learn. I don’t think
we need to change the name.

Everything should be easy to change. We just stop using
multiprocessing.Pool and start using the thread-based equivalent (maybe
concurrent.futures.ThreadPoolExecutor ?).

It’s wonderful to be able to support all platforms. This is a great change.


#33

@wds15 Can we use threads to fire off chains (that may or may not themselves use map_rect)?


#34

Yes!

We should be able to write c++ wrappers at the service level to do just that. I’m guessing this is the way we want to control threading going forward for both RStan and PyStan?


#35

It saves many copies of the data, so if there is no downside, then I would think so.


#36

Just to confirm: Yes, we can use threading for chains and this will work nicely together with a parallel map_rect version based on threads (although the MPI version should not be used in chains which are threaded).

One word of caution though: I have performed quick benchmarks for myself which confirmed me that there was no performance penalty due to threading. However, before we roll this into all the interfaces someone other than me should confirm that we do not see any performance hits. If we want to make the transition from 1 process per chain to threaded chains smoothly, then we should develop the threaded chain facility in parallel to what we have…otherwise our design choice to keep the no-thread option around (and as default) no sense.


#37

The map_rect_concurrent pull is up! Within-chain parallelization is very close to land in develop.


#38

Just thinking that we might run into interactions with GPU. We’ll cross that bridge when we get there.


#39

The issue is that we need to know which parameters are being used so that we can construct the appropriate coupled system. I don’t know how else we could get it other than reading over the AST.

That seems like an odd convention. Given that it’s underlyingly a string, can we choose something else, or is that going to be too awkward for the interface type checks? I take it we can dynamically check the number of threads, as we do in RStan.

This isn’t threading at the algorithm level, it’s at the map function level. But I can see there’s interest for the future in making this multi-threaded. That was my original design intent, but then we found that there was a heavy overhead from thread local autodiff and then found that we were almost never memory bound, so we just went with a much simpler design.

I’m all for opening up threading to the chains level. Mainly because that’ll open up the option for better cross-chain adaptation methods. I’d really like to massively parallelize adaptation. It’s all about exploring the typical set well to estimate mass matrices (metrics).


#40

I think it will be fine. To my understanding if a user ran a GPU function using map concurrent then that function + chunk of data is going to be placed in the GPU job queue and execute normally. It may actually be a pretty good way to make sure we keep the GPU busy.