Linear, parallell regression

Ole_Petter_Hansen · August 20, 2018, 5:07am

This is great, thank you. Not linear speedup yet, but closer! 😀

I’ll post here when I have written up a more complex model. I’m teaching this week, so might take a few days.

andre.pfeuffer · August 20, 2018, 5:48am

It gets slower using 2 shards and faster using 4. How can that be?

Ole_Petter_Hansen · August 20, 2018, 6:28am

I guess there must be a fixed time-cost of splitting up the data (in addition to the map_rect-function itself?

Maybe there is a nice DGP that explains the patterns in the tables above, as function of #Shards and linear computing time?

wds15 · August 20, 2018, 6:49am

Nice to see all these benchmarks. There is an additional case which you may want to follow-up: When using map_rect for large Stan programs in can be benefeicial to do so even if run only on a single core. So if you compile Stan without -DSTAN_THREADS and use thus only a single core, but still split up the data in multiple shards, then you should also see speedups (only for large problems). At least I saw an up to 2x speed increase when doing so. From what I have seen is that this effect gets more pronounced the larger the problem is. The reason for this seems to be (I only have an intuitive understanding here) that the AD tree then does get evaluated in smaller chunks which fits better into the CPU caches. So it’s better to calculate many smaller AD trees than a single big one.

Ole_Petter_Hansen · August 20, 2018, 7:37pm

Here are some simple results. Note though that these are relatively short runs, so there might be measurement error. I attach the bash-script in case @jroon wants to give it a go.

@wds15: his is from the ordinal probit. Do you think i: models with more operations in likelihood evaluations, or ii: models with higher treedepth could benefit from map_rect on a single core?
Edit: I’m thinking specifically of e.g. Horseshoe-shrinkage models, that require an adapt_delta=.999 and hence an high treedepth to avoid divergencies. I.e. the data/parameters might not be very large, but lots of AD-evaluations.

start_benchmark.txt (1.0 KB)


time_in_ms	order	shards	NUM_THREADS	time_linear	ratio

1,310	2	10	-1	991	1.322
1,068	2	10	1	991	1.078
633	2	5	-1	991	0.639
1,189	2	5	1	991	1.200
641	2	4	-1	991	0.647
1,075	2	4	1	991	1.085
819	2	2	-1	991	0.826
1,207	2	2	1	991	1.218
1,212	2	1	-1	991	1.223
1,135	2	1	1	991	1.145
2,509	3	10	-1	5,686	0.441
6,242	3	10	1	5,686	1.098
1,957	3	5	-1	5,686	0.344
6,787	3	5	1	5,686	1.194
2,506	3	4	-1	5,686	0.441
7,350	3	4	1	5,686	1.293
5,490	3	2	-1	5,686	0.966
7,072	3	2	1	5,686	1.244
7,185	3	1	-1	5,686	1.264
7,179	3	1	1	5,686	1.263
23,105	4	10	-1	43,363	0.533
53,215	4	10	1	43,363	1.227
26,088	4	5	-1	43,363	0.602
54,945	4	5	1	43,363	1.267
29,206	4	4	-1	43,363	0.674
49,209	4	4	1	43,363	1.135
44,859	4	2	-1	43,363	1.034
60,593	4	2	1	43,363	1.397
73,637	4	1	-1	43,363	1.698
67,867	4	1	1	43,363	1.565

wds15 · August 21, 2018, 7:16am

Hi!

Maybe more difficult models (large treedepth + adapt_delta) are better for map_rect… at least I would hope that and I would be very interested; especially in the finnish horeshoe.

Just to make sure: The runs you just posted did use two different binaries, right? So for STAN_NUM_THREADS=-1 you have compiled a binary with -DSTAN_THREADS while for the STAN_NUM_THREADS=1 you have compiled another binary without -DSTAN_THREADS… right? This is important as during compilation different things will end up in the binary.

Ole_Petter_Hansen · August 21, 2018, 7:25am

No, they were with the same binary…

I’ll take a look at Vehtari’ shrinkage model.

jroon · August 21, 2018, 9:14am

Sorry guys, I’m suddenly busy dealing with other stuff. I’ll tune back in in a few days!

Ole_Petter_Hansen · August 25, 2018, 1:09pm

Finally weekend - and time for more benchmarking.

Attached is @avehtari’s Horseshoe - both the version in the paper, as well as a parallell implementation of the same model. I generate a small dataset (N=100,K=90) with 9 active regressors, and have very uninformative priors. Then, run the model several times with both the parallel and single version, with adapt delta=.9999 for all runs, but with differing max_depth. The idea is that all runs will bounce against the max treedepth, and we can then see what happens to the time use on the single/parallell version with increasing treesize. Use 10 shards and STAN_NUM_THREADS=-1 for parallel runs.

Results are not impressive for parallel model. These are short runs so noise and all - but overall slower than single. Anyone has an idea of why this seems to be the case?

minttu_horseshoe.stan (1.3 KB)
minttu_horseshoe_par.stan (2.1 KB)
gen_dat.r (568 Bytes)
start .txt (890 Bytes)


N_warmup	N_samples	max_tree	time parallel in ms	time linear in ms	ratio

400	400	5	4,147	1,114	3.723
333	333	6	6,932	1,856	3.735
285	285	7	11,513	2,950	3.903
250	250	8	20,010	5,150	3.885
222	222	9	35,809	8,932	4.009
200	200	10	63,571	15,370	4.136
181	181	11	102,217	17,035	6.000
166	166	12	80,258	42,733	1.878
153	153	13	290,967	74,095	3.927

post.r (380 Bytes)

Edit: This is a small dataset and a simple model, so the single model is hard to beat. However, I had a small hope that some of the map_rect-overhead would be taken once per iteration, such that higher treedepths would show an improvement for parallel runs.

wds15 · August 26, 2018, 4:29pm

My guess would be that with this horshoe models, the sampler has to work a lot and as map_rect only speeds up the log-prob and it’s gradient evaluation, this isn’t anymore the major work to be done for such Stan programs. Thus, map_rect should only pay off if the gradient evaluation time takes long (it needs to be the dominating performance bottleneck) for the single core run (so maybe start watching the gradient evaluation time printed at the beginning of each program).

Still… for large data sets, the parallel version hopefully performs better (or at least allows to scale the performance).

jroon · September 1, 2018, 2:14pm

So I decided to try some runs with higher shard numbers. Check out these results for the linear model we were using earlier. Not I fixed K = 100 for all the data files, and all of these were run with NUM_THREADS = -1 on the 12 core/24thread desktop.


time_in_ms	order	shards	time_linear	ratio

5,526	3	4	2,775	1.991
9,320	3	12	2,775	3.359
16,217	3	24	2,775	5.844
8,885	3	50	2,775	3.202
13,468	3	100	2,775	4.853
15,072	3	200	2,775	5.431
13,279	3	500	2,775	4.785
26,668	3	1,000	2,775	9.610
42,339	4	4	67,159	0.630
63,825	4	12	67,159	0.950
33,780	4	24	67,159	0.503
22,363	4	50	67,159	0.333
35,656	4	100	67,159	0.531
20,535	4	200	67,159	0.306
26,803	4	500	67,159	0.399
26,550	4	1,000	67,159	0.395
680,906	5	4	561,257	1.213
2,389,357	5	12	561,257	4.257
1,023,262	5	24	561,257	1.823
520,853	5	50	561,257	0.928
235,722	5	100	561,257	0.420
409,437	5	200	561,257	0.730
73,767	5	500	561,257	0.131
84,353	5	1,000	561,257	0.150

I glanced at the cpu usage the odd time whil these were running and I noted the faster ones were using up more of the avaiable CPU power - maybe 90%. The less efficent ones were at maybe 30% CPU usage (spread across all cores of course).
Note that the data file sizes for the different orders were: Order 3 = 1.9MB, order 4 = 19.3MB, order 5 = 193.4MB.

wds15 · September 1, 2018, 5:41pm

Cool!!! Which model is this again?

Your results appear to show when map rect is good…with lots of data as I suspected. It is nice to see the scaling of Stan’s performance.

Ole_Petter_Hansen · September 1, 2018, 6:01pm

…so this means shards WAY in excess of number of threads give best results!?

I’m not getting how the overhead works here.

wds15 · September 1, 2018, 6:43pm

The more shards, the smaller the nested ad which we are doing. This yields better cache usage. I would speculate that there is an optimal chunk size. This is obviously balanced by the need for large overall workloads. Interesting…hopefully we can understand this better and do some automatic tuning eventually.

jroon · September 1, 2018, 6:47pm

This is the linear model that Ole wrote above. Reposting here for clarity:
reg_par.stan (986 Bytes)

I think it is as much about the memory size of the shards. I’m currently running a model with 100,000 shards i.e. one shard per one row of data just for curiosity and it is extremely slow - I may have to stop it. Just watching it there for a moment - it is cycling between 1 thread and 24 threads in use as it runs. But even when 24 threads are going its only using 524% CPU (out of possible 2400%). I think that given your particular CPU and it’s cache there is a probably an optimal data chunk/shard size for best efficiency ? But I’m just guessing really!

I’ll run the most efficient model from the data and see what it does

Edit: yes I just loaded one of the more efficient models (order 5, shards 500), and the % in use is cycling up to 1250%. So thats better - but still more efficiency can maybe be gotten :)

Ole_Petter_Hansen · September 1, 2018, 6:53pm

Tuning wrt number of shards? Hm… interesting to see what that number will depend on! Maybe a mix of hardware and how costly likelihood evaluations are?

Also need to figure out how to pad out data and parameters in a nice way in order to handle data/parameter structures that don’t match shard sizes. Probably mostly an indice-tinkering job, though.

Very promising results, though. @jroon if you have time, It would be very interesting to see similar runs for the ordered probit (i.e. a model with a slightly more costly likelihood).

Edit:

Yes - going for linear speedup in #cores :-) Would be cool to be able to run arbitrarily large models at AWS.

jroon · September 1, 2018, 6:57pm

Which level of cache is most important ?
Mine has L1 1.125MB, L2 6MB, L3 32MB.
It seems the optimal shard size based on the above is about 0.4MB ?

@Ole_Petter_Hansen - sure I’ll set it going now although I may not get checking it until tomorrow

jroon · September 1, 2018, 7:27pm

Ok @Ole_Petter_Hansen - here is the probit model:


time_in_ms	order	shards	time_linear	ratio

5,491	3	4	7,608	0.722
2,938	3	12	7,608	0.386
5,388	3	24	7,608	0.708
5,884	3	50	7,608	0.773
6,340	3	100	7,608	0.833
6,458	3	200	7,608	0.849
6,240	3	500	7,608	0.820
6,647	3	1,000	7,608	0.874
26,598	4	4	32,654	0.815
10,923	4	12	32,654	0.335
9,267	4	24	32,654	0.284
8,138	4	50	32,654	0.249
7,212	4	100	32,654	0.221
7,935	4	200	32,654	0.243
8,412	4	500	32,654	0.258
7,013	4	1,000	32,654	0.215
195,838	5	4	251,161	0.780
94,821	5	12	251,161	0.378
61,986	5	24	251,161	0.247
53,757	5	50	251,161	0.214
41,923	5	100	251,161	0.167
44,548	5	200	251,161	0.177
38,911	5	500	251,161	0.155
42,513	5	1,000	251,161	0.169

The data filesizes are order3 = 195kB, order4 = 1.9MB, order5 = 19.5MB.
So back of envelope calculation the most efficient shard size seems to be 19.5MB / 100 ~ 200kB for this model on my machine - unless I’m making horrific assumptions that is, which I surely am

Ole_Petter_Hansen · September 1, 2018, 7:57pm

Your results in figures. Food for thought!

Both models stops at speedup factor around 6, but comes faster for more complex model.

Edit:
Q1: Is the data size the limiting factor for speedups? Adding a few more orders of magnitude would send speedup ut towards #cores?

Q2: I tried above to test models with large trees. Can higher speedup factors be reached with smaller datasets when treesize is large (and shards>>n_cores…)

wds15 · September 1, 2018, 10:02pm

What should be doable is that users provide very granular jobs (many shards) and in map rect we automatically schedule the right chunk size via some clever scheduler. This would make a good issue to keep track of the idea.

Topic		Replies	Views
MPI shard scaling General	5	1133	May 28, 2019
Parallel Options for Independent Regressions Modeling	4	537	April 28, 2021
Help with multi-threading a linear regression model and slicing the design matrix Modeling specification , matrix , paralellization	11	1542	November 20, 2021
Parallelising matrix multiplications Modeling	10	512	May 4, 2023
Computational Efficiency: advanced techniques? Modeling	1	365	October 4, 2022

Linear, parallell regression

Related topics