Linear, parallell regression


#41

This is great, thank you. Not linear speedup yet, but closer! 😀

I’ll post here when I have written up a more complex model. I’m teaching this week, so might take a few days.


#42

It gets slower using 2 shards and faster using 4. How can that be?


#43

I guess there must be a fixed time-cost of splitting up the data (in addition to the map_rect-function itself?

Maybe there is a nice DGP that explains the patterns in the tables above, as function of #Shards and linear computing time?


#44

Nice to see all these benchmarks. There is an additional case which you may want to follow-up: When using map_rect for large Stan programs in can be benefeicial to do so even if run only on a single core. So if you compile Stan without -DSTAN_THREADS and use thus only a single core, but still split up the data in multiple shards, then you should also see speedups (only for large problems). At least I saw an up to 2x speed increase when doing so. From what I have seen is that this effect gets more pronounced the larger the problem is. The reason for this seems to be (I only have an intuitive understanding here) that the AD tree then does get evaluated in smaller chunks which fits better into the CPU caches. So it’s better to calculate many smaller AD trees than a single big one.


#45

Here are some simple results. Note though that these are relatively short runs, so there might be measurement error. I attach the bash-script in case @jroon wants to give it a go.

@wds15: his is from the ordinal probit. Do you think i: models with more operations in likelihood evaluations, or ii: models with higher treedepth could benefit from map_rect on a single core?
Edit: I’m thinking specifically of e.g. Horseshoe-shrinkage models, that require an adapt_delta=.999 and hence an high treedepth to avoid divergencies. I.e. the data/parameters might not be very large, but lots of AD-evaluations.

time_in_ms order shards NUM_THREADS time_linear ratio
1,310 2 10 -1 991 1.322
1,068 2 10 1 991 1.078
633 2 5 -1 991 0.639
1,189 2 5 1 991 1.200
641 2 4 -1 991 0.647
1,075 2 4 1 991 1.085
819 2 2 -1 991 0.826
1,207 2 2 1 991 1.218
1,212 2 1 -1 991 1.223
1,135 2 1 1 991 1.145
2,509 3 10 -1 5,686 0.441
6,242 3 10 1 5,686 1.098
1,957 3 5 -1 5,686 0.344
6,787 3 5 1 5,686 1.194
2,506 3 4 -1 5,686 0.441
7,350 3 4 1 5,686 1.293
5,490 3 2 -1 5,686 0.966
7,072 3 2 1 5,686 1.244
7,185 3 1 -1 5,686 1.264
7,179 3 1 1 5,686 1.263
23,105 4 10 -1 43,363 0.533
53,215 4 10 1 43,363 1.227
26,088 4 5 -1 43,363 0.602
54,945 4 5 1 43,363 1.267
29,206 4 4 -1 43,363 0.674
49,209 4 4 1 43,363 1.135
44,859 4 2 -1 43,363 1.034
60,593 4 2 1 43,363 1.397
73,637 4 1 -1 43,363 1.698
67,867 4 1 1 43,363 1.565

#46

Hi!

Maybe more difficult models (large treedepth + adapt_delta) are better for map_rect… at least I would hope that and I would be very interested; especially in the finnish horeshoe.

Just to make sure: The runs you just posted did use two different binaries, right? So for STAN_NUM_THREADS=-1 you have compiled a binary with -DSTAN_THREADS while for the STAN_NUM_THREADS=1 you have compiled another binary without -DSTAN_THREADS… right? This is important as during compilation different things will end up in the binary.


#47

No, they were with the same binary…

I’ll take a look at Vehtari’ shrinkage model.


#48

Sorry guys, I’m suddenly busy dealing with other stuff. I’ll tune back in in a few days!


#49

Finally weekend - and time for more benchmarking.

Attached is @avehtari’s Horseshoe - both the version in the paper, as well as a parallell implementation of the same model. I generate a small dataset (N=100,K=90) with 9 active regressors, and have very uninformative priors. Then, run the model several times with both the parallel and single version, with adapt delta=.9999 for all runs, but with differing max_depth. The idea is that all runs will bounce against the max treedepth, and we can then see what happens to the time use on the single/parallell version with increasing treesize. Use 10 shards and STAN_NUM_THREADS=-1 for parallel runs.

Results are not impressive for parallel model. These are short runs so noise and all - but overall slower than single. Anyone has an idea of why this seems to be the case?

minttu_horseshoe.stan (1.3 KB)
minttu_horseshoe_par.stan (2.1 KB)
gen_dat.r (568 Bytes)
start .txt (890 Bytes)

N_warmup N_samples max_tree time parallel in ms time linear in ms ratio
400 400 5 4,147 1,114 3.723
333 333 6 6,932 1,856 3.735
285 285 7 11,513 2,950 3.903
250 250 8 20,010 5,150 3.885
222 222 9 35,809 8,932 4.009
200 200 10 63,571 15,370 4.136
181 181 11 102,217 17,035 6.000
166 166 12 80,258 42,733 1.878
153 153 13 290,967 74,095 3.927

post.r (380 Bytes)

Edit: This is a small dataset and a simple model, so the single model is hard to beat. However, I had a small hope that some of the map_rect-overhead would be taken once per iteration, such that higher treedepths would show an improvement for parallel runs.


#50

My guess would be that with this horshoe models, the sampler has to work a lot and as map_rect only speeds up the log-prob and it’s gradient evaluation, this isn’t anymore the major work to be done for such Stan programs. Thus, map_rect should only pay off if the gradient evaluation time takes long (it needs to be the dominating performance bottleneck) for the single core run (so maybe start watching the gradient evaluation time printed at the beginning of each program).

Still… for large data sets, the parallel version hopefully performs better (or at least allows to scale the performance).


#51

So I decided to try some runs with higher shard numbers. Check out these results for the linear model we were using earlier. Not I fixed K = 100 for all the data files, and all of these were run with NUM_THREADS = -1 on the 12 core/24thread desktop.

time_in_ms order shards time_linear ratio
5,526 3 4 2,775 1.991
9,320 3 12 2,775 3.359
16,217 3 24 2,775 5.844
8,885 3 50 2,775 3.202
13,468 3 100 2,775 4.853
15,072 3 200 2,775 5.431
13,279 3 500 2,775 4.785
26,668 3 1,000 2,775 9.610
42,339 4 4 67,159 0.630
63,825 4 12 67,159 0.950
33,780 4 24 67,159 0.503
22,363 4 50 67,159 0.333
35,656 4 100 67,159 0.531
20,535 4 200 67,159 0.306
26,803 4 500 67,159 0.399
26,550 4 1,000 67,159 0.395
680,906 5 4 561,257 1.213
2,389,357 5 12 561,257 4.257
1,023,262 5 24 561,257 1.823
520,853 5 50 561,257 0.928
235,722 5 100 561,257 0.420
409,437 5 200 561,257 0.730
73,767 5 500 561,257 0.131
84,353 5 1,000 561,257 0.150

I glanced at the cpu usage the odd time whil these were running and I noted the faster ones were using up more of the avaiable CPU power - maybe 90%. The less efficent ones were at maybe 30% CPU usage (spread across all cores of course).
Note that the data file sizes for the different orders were: Order 3 = 1.9MB, order 4 = 19.3MB, order 5 = 193.4MB.


#52

Cool!!! Which model is this again?

Your results appear to show when map rect is good…with lots of data as I suspected. It is nice to see the scaling of Stan’s performance.


#53

…so this means shards WAY in excess of number of threads give best results!?

I’m not getting how the overhead works here.


#54

The more shards, the smaller the nested ad which we are doing. This yields better cache usage. I would speculate that there is an optimal chunk size. This is obviously balanced by the need for large overall workloads. Interesting…hopefully we can understand this better and do some automatic tuning eventually.


#55

This is the linear model that Ole wrote above. Reposting here for clarity:
reg_par.stan (986 Bytes)

I think it is as much about the memory size of the shards. I’m currently running a model with 100,000 shards i.e. one shard per one row of data just for curiosity and it is extremely slow - I may have to stop it. Just watching it there for a moment - it is cycling between 1 thread and 24 threads in use as it runs. But even when 24 threads are going its only using 524% CPU (out of possible 2400%). I think that given your particular CPU and it’s cache there is a probably an optimal data chunk/shard size for best efficiency ? But I’m just guessing really!

I’ll run the most efficient model from the data and see what it does

Edit: yes I just loaded one of the more efficient models (order 5, shards 500), and the % in use is cycling up to 1250%. So thats better - but still more efficiency can maybe be gotten :)


#56

Tuning wrt number of shards? Hm… interesting to see what that number will depend on! Maybe a mix of hardware and how costly likelihood evaluations are?

Also need to figure out how to pad out data and parameters in a nice way in order to handle data/parameter structures that don’t match shard sizes. Probably mostly an indice-tinkering job, though.

Very promising results, though. @jroon if you have time, It would be very interesting to see similar runs for the ordered probit (i.e. a model with a slightly more costly likelihood).

Edit:

Yes - going for linear speedup in #cores :-) Would be cool to be able to run arbitrarily large models at AWS.


#57

Which level of cache is most important ?
Mine has L1 1.125MB, L2 6MB, L3 32MB.
It seems the optimal shard size based on the above is about 0.4MB ?

@Ole_Petter_Hansen - sure I’ll set it going now although I may not get checking it until tomorrow


#58

Ok @Ole_Petter_Hansen - here is the probit model:

time_in_ms order shards time_linear ratio
5,491 3 4 7,608 0.722
2,938 3 12 7,608 0.386
5,388 3 24 7,608 0.708
5,884 3 50 7,608 0.773
6,340 3 100 7,608 0.833
6,458 3 200 7,608 0.849
6,240 3 500 7,608 0.820
6,647 3 1,000 7,608 0.874
26,598 4 4 32,654 0.815
10,923 4 12 32,654 0.335
9,267 4 24 32,654 0.284
8,138 4 50 32,654 0.249
7,212 4 100 32,654 0.221
7,935 4 200 32,654 0.243
8,412 4 500 32,654 0.258
7,013 4 1,000 32,654 0.215
195,838 5 4 251,161 0.780
94,821 5 12 251,161 0.378
61,986 5 24 251,161 0.247
53,757 5 50 251,161 0.214
41,923 5 100 251,161 0.167
44,548 5 200 251,161 0.177
38,911 5 500 251,161 0.155
42,513 5 1,000 251,161 0.169

The data filesizes are order3 = 195kB, order4 = 1.9MB, order5 = 19.5MB.
So back of envelope calculation the most efficient shard size seems to be 19.5MB / 100 ~ 200kB for this model on my machine - unless I’m making horrific assumptions that is, which I surely am


#59

Your results in figures. Food for thought!

Both models stops at speedup factor around 6, but comes faster for more complex model.



Edit:
Q1: Is the data size the limiting factor for speedups? Adding a few more orders of magnitude would send speedup ut towards #cores?

Q2: I tried above to test models with large trees. Can higher speedup factors be reached with smaller datasets when treesize is large (and shards>>n_cores…)


#60

What should be doable is that users provide very granular jobs (many shards) and in map rect we automatically schedule the right chunk size via some clever scheduler. This would make a good issue to keep track of the idea.