Map_rect threading


This weekend a few improvements on map_rect went into Stan-math develop which is why I got curious how the current threading and MPI implementation fare against each other. When rerunning the benchmark from my StanCon 2018 contribution, I am getting now this:


As you can see:

  • the 1 core performance for MPI and threading is essentially the same
  • the scaling with more CPUs is really good for MPI and is a bit less good for threading

Very likely, the threading case is suffering from not using a thread-pool which results in re-creation of threads (and the memory for the AD tree needs reallocation). The forthcoming Intel TBB will introduce such a thread-pool and provide better scaling here (hopefully).

As a side-effect of bringing down the performance penalty of turning on threading we are now using a different technique to store things. The consequence of this is that map_rect does work with threading on Windows using the RTools gcc 4.9.3 compiler.

If people out there could actually confirm once more that map_rect now works on Windows with the gcc 4.9.3 from RTools, then that would be very reassuring. I have tested it and it’s now part of our testing suite - so it should all work now. To try this out one would need to download the current cmdstan develop branch.



Nice! What exactly from the repo should one run to check whether it works on Windows or not?

I have a Windows machine for testing that has the minimum installation (gcc with rtools) and I could run it on that. But not sure what to test exactly.

I was more referring to people on the forum who have their models up and running. I saw a while ago that some Windows users even went to the trouble of running things inside of docker. Anyway, if you want to get the above up and running you can use the files attached.

init-1.R (2.0 KB)
stan_data.R (5.4 KB)
warfarin_pd_tlagMax_2par_generated_218.stan (6.5 KB)

1 Like


I was still a bit bothered by the performance offset between MPI and threading as I was expecting that threading is just as fast as MPI with the recent TLS faster merge. Thinking about it I figured that maybe the threading program has inefficient memory allocation since we do not use a so-called scalable memory allocator which is so strongly advised to be used for multi-threaded applications. The scalable memory allocator essentially ensures memory locality wrt to the core which requests the memory. The nice thing is that with the Intel TBB we can simply link the Stan program against their malloc library and that replaces all malloc calls with a scalable malloc. Here is the result from that:


Given that I have not changed any code, but just linked against the tbbmalloc_proxy library this is a really nice result.

It also shows that multi-core stuff is more involved than one might think as memory locality is more complex (and our current approach to grow a single memory pool may not be adequate in fact, since the memory pool is always in the vicinity of the “main” core - but not all the others).

The above is based on the TLSv6 PR which will hopefully land in Stan-math soon.

These results suggest that we should push the Intel TBB into Stan-math sooner than later - just for the tbbmalloc_proxy it is worthwhile to do so, I think.



Can you paraphrase the results? I didn’t undestand the three different back ends.

Where is the malloc pressure? In intermediate Eigen and std vector structures? There shouldn’t be any from the autodiff stack as that just happens once.

The backends:

  • developTLS: uses threading with the TLS PR (STAN_THREADS is on)
  • developMPI: uses MPI (but STAN_THREADS is off)
  • developTLS_tbbmalloc: same as developTLS, but this one links in the libtbbmalloc_proxy - this leads to a scalable malloc being used

Where is the malloc pressure? In intermediate Eigen and std vector structures? There shouldn’t be any from the autodiff stack as that just happens once.

The malloc is happening once - that’s good - but that malloc happens close to whatever core is the main one, but in a multi-threaded application the allocated memory needs to be close to the core which is actually doing the work. That’s a different thing and means that memory locality is more complex to achieve.

Oh, right. You have multiple thread-local autodiff stacks. Those should only get allocated once, too.

I don’t get it. The only one that’s even close to performant per flop is develop MPI. Why build these other things if they’re worse than our existing MPI?

MPI is a pain to program up and threading is much easier to setup and use for users. MPI will probably never work on Windows; threading is a lot easier to setup and far easier to develop with.

And, I mean, MPI has a perfect scalable allocation…so all what we need to take into account are the specifics of threading to get good performance. I very much hope that a backend which uses the Intel TBB for every parts of the parallelization will get use even closer to the MPI performance.

As I am building an R package based on Stan for third party users, I would like a confirmation if threading is dependent (or independent) of openMPI. In other words, is some MPI related framework needed or installing rstan and setting the right flags is enough?

Threading does not need any MPI things to be installed on the target system. To my knowledge threading is the only supported backed for RStan which enables parallelisation in Stan programs.

1 Like

When two pieces of data required by different cores are in the same cache line the entire line gets copied. The articles about that point make it sound like this can devolve into thrashing at the cache level where the line is copied back and forth.

I understand the following is required for threading:

CXXFLAGS += -pthread

What set of flags does one need to set for MPI to work? And should the -DSTAN_THREADS and -pthread flags be retained alongside with MPI-specific flags?

No…if you use moo then do not use threading. Have look at the wiki for mpi flags.

Edit: moo is what t9 made out of mpi!!!

Essentially yeah, it’s a case of ‘true sharing’ where both cores have to do some weird things to make sure neither one is overwriting the data the other one is using.

@wds15 have you tried increasing the alignment size for the malloc? Setting it to your CPUs L1 cache size may give you a little better performance

I doubt that this helps us. We are using large memory pools and I don’t think that their exact alignment is important. At least this would be my intuition.