Thanks to @sakrejda the PR for the map_rect_concurrent implementation is reviewed and ready to be merged. However, before hitting “merge” I wanted to get everyone informed and on board with the decisions taken. So these are in particular:
map_rect_serial is dropped as the concurrent version will simplify to a serial version if STAN_THREADS is not defined during compilation.
The number of threads created is (roughly) controlled by the environment variable STAN_NUM_THREADS (just like for openMP which uses OMP_NUM_THREADS).
Threading stuff is only ever used if things are compiled with STAN_THREADS
Not setting STAN_NUM_THREADS or setting it to a non-numeric value results in a single thread being used
Setting STAN_NUM_THREADS to a positive value means to use that many threads (at most)
Setting STAN_NUM_THREADS to -1 means to use as many threads as we have CPUs on the machine
Having more jobs to execute than threads means to reduce the number of threads to the number of jobs
Anything else will just use 1 thread
I hope everyone is fine with these conventions. I suggest to discuss this on Thursday (unless everyone is happy with this right away).
Note that because I am using the C++11 async facility I can only control how many chunks of work I am generating. This should translate into how many threads are created, but I have no control over that, since this is decided upon by the implementation. This is why the number of threads quoted above corresponds to the maximal number of threads (maybe we improve on that using a threadpool in the future).
… and yes, within-chain parallelization in Stan is now super close to land.
@Bob_Carpenter, @syclik, @seantalts … comments on this or can we merge? Let me know before the weekend if possible so that the PR can go in. I am really looking forward to that.
Do you have any guidance on what sort of speedup this can give and what sort of overhead it has? How slow does each leapfrog step need to be before this is worth using? How many threads can I start before the overhead is no longer worth it? Even one or two examples would be helpful in determining if this is appropriate for my models.
I have posted a few case studies on the forum which used MPI instead of threading. I would guess that threading has similar properties, but we need to find out on that.
So maybe search the forum for MPI / speedup / wds15 and you should find some material.
My guess is whenever your model evaluation time crosses the 1h mark, then this is approach is helpful if the structure amends to it. By that I mean that you can parallelize over some natural unit in your model and the likelihood contribution per unit is costly to calculate. A generalized linear model with lots of data is an example where I would guess that you do not get much of a gain… although I would not rule it out in case you work out a good blocking of the data.
I have worked out examples where ODEs are solved per patient and you get almost linear speedups in # of CPUS. For analytically solvable phamarcokinetic problems you also get very nice speedups. When it’s just about evaluating a generalized linear model, then things are different.
Yes, it‘s about the time it takes to calculate the log lik and it‘s gradient. This is practically the time per leapfrog step, yes.
Since it is a bit of a pain to use map rect due to the need for packing and unpacking of parameters and the data, i was referring to the total time which is what matters to a user.
I’ve built and tested the develop branch (namely the map_rect_concurrent_test) with g++ 4.9.3 on Windows, and successfully compiled and run the mapped logit example from the manual, included below.
However when I set STAN_NUM_THREADS to anything but 1, CmdStan does not give the correct output. Instead the program prints the CmdStan control arguments, terminates, and saves them to the output file. There are no samples, progress messages or other errors.
Any ideas where I might be going wrong?
Cheers,
Andrew
// Mapped logit example
functions {
vector lr(vector beta, vector theta, real[] x, int[] y) {
real lp = bernoulli_logit_lpmf(y | beta[1] + to_vector(x) * beta[2]);
return [lp]';
}
}
data {
int y[12];
real x[12];
}
transformed data {
// K = 3 shards
int ys[3, 4] = { y[1:4], y[5:8], y[9:12] };
real xs[3, 4] = { x[1:4], x[5:8], x[9:12] };
vector[0] theta[3];
}
parameters {
vector[2] beta;
}
model {
beta ~ std_normal();
target += sum(map_rect(lr, beta, theta, xs, ys));
}
Now this should work in Windows, but I did not have a chance to test this. Let me try to get this going on Windows so that I can see the issue. In the meantime…is a newer compiler available to you on Windows? I guess you use the RTools for windows, is that right?
Since we are using vanilla C++11 this has to work somehow on windows.
Ok, now I know what you mean. So the Rtools mingw compilers DO NOT support threads as it looks like. It seems that there are mingw distributions which support threading out of the box. However, we should probably try to get the Rtools for windows running with this. A solution could be to get this here up and running:
If you want to try that, then please report back your progress. I will look into it when I find a calm moment. Let’s see. Windows… argh!
It wasn’t immediately clear how best to use these headers, but I tried including -isystem /path/to/mingw-std-threads in the CmdStan makefile and rebuilding which did not work. I’m using Rtools 3.4 which looks like the latest stable release (3.5 is available though).
I can report that threading works as intended on g++ 5.4.0 on Ubuntu 16.04 (which is probably the more important use case for me, Windows was a distraction). I did have to add -pthread to the CXX_FLAGS for it to compile properly.
I’ve been playing with this on our server and it’s working great. A dummy ODE model goes from ~60 sec on one thread to ~15 sec with 10 threads.
One question, one comment:
Is there a rule of thumb for splitting up the data? With MPI it makes sense to have # shards = # cores, but I’m not sure if threading shares the same overheads.
The elapsed time counters are truly out of whack when doing parallel computation. The 15 sec run reported a total runtime of 97 sec (I used a stopwatch).
My advise is to have the data-split align with the structure of your model. So hierarchical models are easy in this sense (split by unit over what your hierarchical model is; if have multiple levels, then maybe choose the level at which you split of your hierarchical model). Internally the split up data set is chunked into as many workers you give to Stan. Thus, the ordering of your data will determine how the work distributes over the chunks. That should be on average the same work if your data is not ordered in a particular manner. So have an eye on the order of the jobs if you want it super-efficient (I would start with random order).
We have not yet combined MPI with threading yet and I haven’t tested it. However, if you want to, you actually can do that already now using nested map_rect. MPI will always be the outer parallelization engine and if you nest a map_rect call into that you will split by threads. However, I would not bother with that at the moment, but let me know what happens should you try.
The reported time by cmdstan is CPU time, but not wallclock time. So on a single core things take 60s CPU time while the accumulated CPU time is 97s when using 10 threads. This shows the additional overhead due to threading (ok, it only roughly aligns, but this is my understanding of it). We will probably need to change the reported time from CPU time to wallclock time. When I run benchmarks, I am using the command “time” to get those numbers.
Thanks for pointing that out. @syclik — did you update the doc? Mind updating the Wiki? I’d do it, but I’m not entirely sure how this works or if this doc is still relevant.
We need to push this doc up to the Stan reference manual, too, if it’s going to be specified commonly among the interfaces with the same environment variables (and I sure hope it is, since this is going to the math lib).
Oh… I can update the wiki. I indeed did not mention how to specify with the runtime environment variable STAN_NUM_THREADS the number of threads which are going to be used.
Right now the details for MPI and threads live on the wiki as they are both stan-math specific (we link to the wiki from the cmdstan manual). It is not clear to me where this should land in the stan reference manual. It would be good to avoid duplication of the documentation if possible.
… I hope it’s clear, @andre.pfeuffer, how you set the number of threads?
@wds15
I suppose I can use threading without installing MPI. Is this true?
And what happen, if both threading and MPI being enabled at the same time?
Are they conflicting?
When use Threading one has to set CXXFLAGS += -DSTAN_THREADS
in cmdstan-2.18.0/make/local and STAN_NUM_THREADS in the environment.
Then recompile and map_rect will work in parallel?
I’m using Linux, I’d put in my .bashrc file the following line: export STAN_NUM_THREADS=4
or set direct in the shell before compiling.