Map_rect concurrent about to land




Thanks to @sakrejda the PR for the map_rect_concurrent implementation is reviewed and ready to be merged. However, before hitting “merge” I wanted to get everyone informed and on board with the decisions taken. So these are in particular:

  • map_rect_serial is dropped as the concurrent version will simplify to a serial version if STAN_THREADS is not defined during compilation.
  • The number of threads created is (roughly) controlled by the environment variable STAN_NUM_THREADS (just like for openMP which uses OMP_NUM_THREADS).
  • Threading stuff is only ever used if things are compiled with STAN_THREADS
  • Not setting STAN_NUM_THREADS or setting it to a non-numeric value results in a single thread being used
  • Setting STAN_NUM_THREADS to a positive value means to use that many threads (at most)
  • Setting STAN_NUM_THREADS to -1 means to use as many threads as we have CPUs on the machine
  • Having more jobs to execute than threads means to reduce the number of threads to the number of jobs
  • Anything else will just use 1 thread

I hope everyone is fine with these conventions. I suggest to discuss this on Thursday (unless everyone is happy with this right away).

Note that because I am using the C++11 async facility I can only control how many chunks of work I am generating. This should translate into how many threads are created, but I have no control over that, since this is decided upon by the implementation. This is why the number of threads quoted above corresponds to the maximal number of threads (maybe we improve on that using a threadpool in the future).

… and yes, within-chain parallelization in Stan is now super close to land.


See the PR:


BTW the “at most” part here also corresponds to how OMP chunks work. You can’t (easily?) force it to use more threads than it wants.


Stan’s take on C++17 parallel STL.



@Bob_Carpenter, @syclik, @seantalts … comments on this or can we merge? Let me know before the weekend if possible so that the PR can go in. I am really looking forward to that.


With the changes we discussed… this is MERGED. Now we can finally do hierarchical ODE models in finite time using Stan.


Do you have any guidance on what sort of speedup this can give and what sort of overhead it has? How slow does each leapfrog step need to be before this is worth using? How many threads can I start before the overhead is no longer worth it? Even one or two examples would be helpful in determining if this is appropriate for my models.


I have posted a few case studies on the forum which used MPI instead of threading. I would guess that threading has similar properties, but we need to find out on that.

So maybe search the forum for MPI / speedup / wds15 and you should find some material.

My guess is whenever your model evaluation time crosses the 1h mark, then this is approach is helpful if the structure amends to it. By that I mean that you can parallelize over some natural unit in your model and the likelihood contribution per unit is costly to calculate. A generalized linear model with lots of data is an example where I would guess that you do not get much of a gain… although I would not rule it out in case you work out a good blocking of the data.

I have worked out examples where ODEs are solved per patient and you get almost linear speedups in # of CPUS. For analytically solvable phamarcokinetic problems you also get very nice speedups. When it’s just about evaluating a generalized linear model, then things are different.


Found it. Parallelization (again) - MPI to the rescue!

But I’m guessing it has to do more with the time per leapfrog step than overall fitting time?


Yes, it‘s about the time it takes to calculate the log lik and it‘s gradient. This is practically the time per leapfrog step, yes.

Since it is a bit of a pain to use map rect due to the need for packing and unpacking of parameters and the data, i was referring to the total time which is what matters to a user.


Hi, this is awesome!

I’ve built and tested the develop branch (namely the map_rect_concurrent_test) with g++ 4.9.3 on Windows, and successfully compiled and run the mapped logit example from the manual, included below.

However when I set STAN_NUM_THREADS to anything but 1, CmdStan does not give the correct output. Instead the program prints the CmdStan control arguments, terminates, and saves them to the output file. There are no samples, progress messages or other errors.

Any ideas where I might be going wrong?


// Mapped logit example
functions {
  vector lr(vector beta, vector theta, real[] x, int[] y) {
    real lp = bernoulli_logit_lpmf(y | beta[1] + to_vector(x) * beta[2]);
    return [lp]';
data {
  int y[12];
  real x[12];
transformed data {
  // K = 3 shards
  int ys[3, 4] = { y[1:4], y[5:8], y[9:12] };
  real xs[3, 4] = { x[1:4], x[5:8], x[9:12] };
  vector[0] theta[3];
parameters {
  vector[2] beta;
model {
  beta ~ std_normal();
  target += sum(map_rect(lr, beta, theta, xs, ys));


Glad to see this being picked up.

Now this should work in Windows, but I did not have a chance to test this. Let me try to get this going on Windows so that I can see the issue. In the meantime…is a newer compiler available to you on Windows? I guess you use the RTools for windows, is that right?

Since we are using vanilla C++11 this has to work somehow on windows.


Ok, now I know what you mean. So the Rtools mingw compilers DO NOT support threads as it looks like. It seems that there are mingw distributions which support threading out of the box. However, we should probably try to get the Rtools for windows running with this. A solution could be to get this here up and running:

If you want to try that, then please report back your progress. I will look into it when I find a calm moment. Let’s see. Windows… argh!


Bummer, that’s a nuisance.

It wasn’t immediately clear how best to use these headers, but I tried including -isystem /path/to/mingw-std-threads in the CmdStan makefile and rebuilding which did not work. I’m using Rtools 3.4 which looks like the latest stable release (3.5 is available though).

I can report that threading works as intended on g++ 5.4.0 on Ubuntu 16.04 (which is probably the more important use case for me, Windows was a distraction). I did have to add -pthread to the CXX_FLAGS for it to compile properly.

Thanks for all your hard work!


R from conda uses msys2-mingw64. I believe threading should work with that on Windows.


I’ve been playing with this on our server and it’s working great. A dummy ODE model goes from ~60 sec on one thread to ~15 sec with 10 threads.

One question, one comment:

  • Is there a rule of thumb for splitting up the data? With MPI it makes sense to have # shards = # cores, but I’m not sure if threading shares the same overheads.

  • The elapsed time counters are truly out of whack when doing parallel computation. The 15 sec run reported a total runtime of 97 sec (I used a stopwatch).



Sounds great!

  • My advise is to have the data-split align with the structure of your model. So hierarchical models are easy in this sense (split by unit over what your hierarchical model is; if have multiple levels, then maybe choose the level at which you split of your hierarchical model). Internally the split up data set is chunked into as many workers you give to Stan. Thus, the ordering of your data will determine how the work distributes over the chunks. That should be on average the same work if your data is not ordered in a particular manner. So have an eye on the order of the jobs if you want it super-efficient (I would start with random order).

  • We have not yet combined MPI with threading yet and I haven’t tested it. However, if you want to, you actually can do that already now using nested map_rect. MPI will always be the outer parallelization engine and if you nest a map_rect call into that you will split by threads. However, I would not bother with that at the moment, but let me know what happens should you try.

  • The reported time by cmdstan is CPU time, but not wallclock time. So on a single core things take 60s CPU time while the accumulated CPU time is 97s when using 10 threads. This shows the additional overhead due to threading (ok, it only roughly aligns, but this is my understanding of it). We will probably need to change the reported time from CPU time to wallclock time. When I run benchmarks, I am using the command “time” to get those numbers.