Parallel autodiff v4

I’m making a new thread following up on: Parallel autodiff v3

In that thread we designed and implemented a reduce_sum function.

We designed a function reduce_sum with signature:

real reduce_sum(F func, T[] sliced_arg, int grainsize, T1 arg1, T2 arg2, ...)

Where func is a function with the signature:

real func(int start, int end, T[] sliced_arg, T1 arg1, T2 arg2, ...)

reduce_sum implements the functionality defined by:

int N = size(sliced_arg);
real sum = func(1, N, sliced_arg, arg1, arg2, ...)

in parallel by assuming that the work done by func can be equivalently broken up into pieces:

int N = size(sliced_arg);
real sum = func(1, M, sliced_arg[1:M], arg1, arg2, ...) +
           func(M + 1, N, sliced_arg[M + 1:N], arg1, arg2, ...)

where 1 <= M < N (and this can be repeated recursively).

There’s currently a math branch that implements this in C++:

And a branch of stanc3 that handles the language bits (the linux binary is available here:

Edit: @rok_cesnovar made a branch of cmdstan that has the Linux binaries included: (Parallel autodiff v4)

Alright and the actual thing I wanted to get to is that I think I have a model that has linear scaling by running stuff in parallel but with threading I only get a 2x speedup.

Here is an example model with data:
nbme1.stan (1.1 KB) (2.6 MB)

You can run this model with:

./nbme1 sample data

Here is a threaded version for use with the new Math branch and the new stanc3 compiler: nbme6.stan (1.7 KB)

You can run this model with:

STAN_NUM_THREADS=4 ./nbme6 sample data

where STAN_NUM_THREADS sets the number of threads.

It takes about 18 seconds for the non-reduce_sum version of the model to run. It takes about 22 seconds for a single-threaded version of the reduce_sum version of the model to run.

That doesn’t bother me so much, but the scaling for the threaded version seems quite bad.

With four threads it takes about 11 seconds for the reduce_sum model to run. So it would take 44 seconds or so for four chains. However, if I run four chains in parallel I get linear speedup! (they finish in 18~ seconds still).

You can compare against running four chains with a simple run script, copy paste this to something like

./nbme1 sample data output file=output1.csv
./nbme1 sample data output file=output2.csv
./nbme1 sample data output file=output3.csv
./nbme1 sample data output file=output4.csv

And then do:

cat | parallel -j 4

To run four chains in parallel.

You can make this model harder or easier by editing the N_subset variable in As I uploaded it, it only uses 2000 bits of data. You can push that up to 40000.


@wds15, @Stevo15025, @rok_cesnovar

Interesting, will have a look a bit later.

For now I just prepared a cmdstan branch that has everything you need if you are on linux (linux-stanc and correct stan math submodule).

If anyone builds themselves a mac/windows binary feel free to push it up to the branch or paste it here and I can do it.


Here’s another parallel model: base.stan (9.7 KB) (682.7 KB)

Here’s the serial version: base_serial.stan (8.5 KB)

The serial single core version takes about 35 seconds.

The single core reduce_sum version takes about 45 seconds.

A 4-core run of the reduce_sum code takes about 25 seconds (so 4 chains would take about 100 seconds).

A 4-chain run of the single core code takes about 45 seconds.

grainsize and N_subset are defined in the data file. Right now it’s grainsize = 100, N_subset = 400, so make that bigger before trying more threads.

Are people really using target += with _lpdf functions? That causes a gazillion normalizing constants we don’t need to be computed.

Welp, I am actually getting

> double free or corruption (!prev)
> Aborted (core dumped)

and seg faults when threads is >1. Using ubuntu with g++ 8.3. Dont have the debug energy tonight but will try to look tomorrow. I merged recent math develop with paralle math branch and that didnt do it.

Didnt compile with STAN_THREADS. This idiot needs to go to sleep :)

For now I can confirm I get similar behavior on Ubuntu with g++ 8.3.

base.stan base_serial.stan (edited) is brms code. @paul.buerkner just so you’re aware.

Cool. I downloaded the VTune profiler and it told me that TBB was spending a lot of time cloning recursive_reducer objects. I changed how memory is allocated for the adjoints to make this faster ( It’s still unusually slow but that profiler seems to have a lot of info in it.

Thank you! brms supports computing marginal likelihoods via bridgesampling after the model fitting and thus requires the normalizing constants.

1 Like

Yeah…that’s why you calculate these constants, but my suspicion is that these constant can be expensive to calculate. An example is the Poisson distribution. With the constant you end up calculating lots of gamma function calls which you can avoid. Making these constants optional to calculate would be really good (or calculate them later somehow).

Now, I have time to look into this next week hopefully. What you can try, @bbbales2, is to use the thread local thing from the tbb. You can store the large containers which we need only once per thread instead of per reducer object. The final reduce would then operate on these per thread data structure…but I am not sure if this really helps.

Other than that I do think that the problem size is too small. I think the example I used took 70s for 200 iterations. I would start with these larger problems first as these are the easier cases and anyway the more interesting ones to speed up.

As a next thing you should try the deterministic tbb thing combined with large grainsizes. I would start with that and then begin with very large grainsizes which you scale down by some factor for each try. The thing is that the tbb scheduler itself is a bottleneck if the amount of work is not very large and the deterministic thing will avoid that.

Calculate them later will hardly be possible in this generality. Please feel free to open an issue to make their computations optional.

Well I just got rid of the largest memory thing.

They both can scale up a ton by increasing N_subset (also allowing for larger grainsize). Maybe you’re right here, but I don’t feel that great overall about this. (edit: Just to be clear I’m alright with you being right I just don’t feel great about things being slow on these small problems :P)

Here’s the VTune output for one run of the nbme model with N_subset = 2000, grainsize = 125:

Here’s the VTune output for one run of the nbme model with N_subset = 16000, grainsize = 1000:

In this case the serial version took 240 seconds. So 240 vs. 70 is like a 3.5x speedup compared with the 2x speedup with the smaller problem.

Average leapfrogs was 51 and sampling took 100 seconds. So that means per leapfrog we had 100 seconds / (51 leapfrogs/sample * 1000 samples) = 0.002 seconds/leapfrog.

I guess 2 milliseconds isn’t a lot of work to break up, but I still feel pretty uncertain about this.

Someone else is using 4 cores on my computer now, but I just ran one of the poisson models: poisson.stan (736 Bytes) poisson_serial.stan (396 Bytes) (6.3 KB)

Single core poisson_serial was 8s. 4 threads was 3.3 seconds. So that’s like a 2.4x speedup compared with the base.stan model which got less than a 2x speedup on 4 cores and the nbme model which was a 2x speedup on 8 cores.

I guess my question is where is this overhead coming from? Is this associated with scheduling threads? In which case I wonder if deterministic threading like you suggest is better.

Or is this coming from threadlocal stuff? Or however the thread specific stacks are managed?

I guess I agree that the scaling would get better if we upped the problem size here. I’d like it to work for these problems though, if possible, but I also agree that small problems aren’t the most important thing here. If a model takes 30 seconds to run, that’s good enough.

I thought the ~ was just syntatic sugar? Or is there some way that’s not target += normal_lpdf(y | ...)?

~ drops proportionality constants. The explicit calls to *_lpdf don’t. The compiler sets the propto template parameter that’s in all the lpdfs according to which is used. I forget which way is which.

1 Like

propto = false keeps all constants, propto = true drops them.

1 Like

The results you show make perfect sense for me from what I see. When you increase the problem size, then you can get better scaling with more cores. Speeding up small problems is really, really hard. I mean just imagine if the small problem basically fits in the CPU cache…when you split that up, then you always loose. If you now have a problem which blows the cache size, then splitting things up onto multiple CPUs (and CPU caches) is a good thing.

I am relatively sure, that for the small problems the determinstic scheduling is preferable, I think. BTW, the TBB uses a threadpool, so that no thread is spawned during execution. Instead, the TBB has to distribute tasks which are being executed by the threads of the threadpool. So have a try of the determinstic scheduler - which you should setup such that the grainsize is as large as possible to split into as few as possible chunks. I would start with the grainsize being equal to the number of elements as the minimal chunksize is grainsize/2 as I recall (maybe check the TBB doc). Then you can decrease the grainsize.

I will look myself early next week.

@rok_cesnovar I updated the math in the cmdstan you put together.

1 Like

Cool cool. Feel free. I forgot to mention yesterday I also added your examples and data on the branch.

We need to come up with a system to create the stanc3 binaries for all 3 oses for PRs.

Running the Linux binary with docker is really not that hard. I can setup a quick and dirty wiki page for that if that’s needed.

Another idea could be to create a docker image which can do Stan to c++ conversion using a server like httpstan?

This is still giving me really weird performance characteristics.

First of all I’m updating the nbme models in the first post. At least the serial one was messed up.

For the nbme model with N_subset = 2000

21s for serial code
24s for 1 core, grainsize = 125
11s for 8 cores, grainsize = 125

For the nbme model with N_subset = 20000

230s for serial code
170s for 1 core, grainsize = 1250
75s for 8 cores, grainsize = 1250

So if everything is 10x bigger but the threaded scaling is still the same! Ugg!!

I checked bin/stansummary on all these outputs so I think I’m doing things right. I don’t have an explanation for why the serial code is slower in the second. Presumably a cache thing, but I’m more concerned with why my threaded gains didn’t get better.

I recoded everything with a parallel_for that should have had no clone overhead, but that wasn’t any faster.