Parallel autodiff v4

wds15 · March 18, 2020, 8:53am

Yack - I had to rerun the entire thing as the single core ODE runs did blow the resource limits I had set. Then I had to find out that longer runs can only run on a different CPU architecture on our cluster. Thus, the results below are now on a newer and faster CPU (Intel® Xeon® CPU E5-2640 v4 @ 2.40GHz).

And the mean results (time is now in minutes):

method	solver	J	cores	runs	mean_speedup	mean_runtime
map_rect	analytic	64	1	5	1.00	0.77
map_rect	analytic	64	2	5	1.83	0.42
map_rect	analytic	64	4	5	2.81	0.27
map_rect	analytic	64	8	5	3.75	0.21
map_rect	analytic	64	16	5	3.74	0.21
map_rect	analytic	128	1	5	1.00	1.04
map_rect	analytic	128	2	5	1.69	0.62
map_rect	analytic	128	4	5	2.96	0.35
map_rect	analytic	128	8	5	3.92	0.27
map_rect	analytic	128	16	5	3.70	0.28
map_rect	matrixExp	64	1	5	1.00	1.88
map_rect	matrixExp	64	2	5	1.90	0.99
map_rect	matrixExp	64	4	5	3.53	0.54
map_rect	matrixExp	64	8	5	6.50	0.29
map_rect	matrixExp	64	16	5	11.51	0.16
map_rect	matrixExp	128	1	5	1.00	4.03
map_rect	matrixExp	128	2	5	1.97	2.04
map_rect	matrixExp	128	4	5	3.87	1.04
map_rect	matrixExp	128	8	5	6.97	0.58
map_rect	matrixExp	128	16	5	12.03	0.34
map_rect	ODE	64	1	5	1.00	154.12
map_rect	ODE	64	2	5	1.86	82.88
map_rect	ODE	64	4	5	3.39	45.49
map_rect	ODE	64	8	5	5.67	27.21
map_rect	ODE	64	16	5	6.90	22.36
map_rect	ODE	128	1	5	1.00	286.29
map_rect	ODE	128	2	5	1.90	150.61
map_rect	ODE	128	4	5	3.65	78.59
map_rect	ODE	128	8	5	6.11	46.88
map_rect	ODE	128	16	5	7.21	39.71
reduce_sum	analytic	64	1	5	1.00	0.65
reduce_sum	analytic	64	2	5	1.83	0.36
reduce_sum	analytic	64	4	5	2.99	0.22
reduce_sum	analytic	64	8	5	4.09	0.16
reduce_sum	analytic	64	16	5	3.99	0.16
reduce_sum	analytic	128	1	5	1.00	0.87
reduce_sum	analytic	128	2	5	1.91	0.46
reduce_sum	analytic	128	4	5	3.25	0.27
reduce_sum	analytic	128	8	5	4.03	0.22
reduce_sum	analytic	128	16	5	4.19	0.21
reduce_sum	matrixExp	64	1	5	1.00	2.85
reduce_sum	matrixExp	64	2	5	2.58	1.11
reduce_sum	matrixExp	64	4	5	6.06	0.47
reduce_sum	matrixExp	64	8	5	8.62	0.33
reduce_sum	matrixExp	64	16	5	18.81	0.15
reduce_sum	matrixExp	128	1	5	1.00	3.65
reduce_sum	matrixExp	128	2	5	1.95	1.87
reduce_sum	matrixExp	128	4	5	3.78	0.97
reduce_sum	matrixExp	128	8	5	6.55	0.56
reduce_sum	matrixExp	128	16	5	11.88	0.31
reduce_sum	ODE	64	1	5	1.00	155.73
reduce_sum	ODE	64	2	5	2.18	72.22
reduce_sum	ODE	64	4	5	3.43	45.63
reduce_sum	ODE	64	8	5	5.80	26.92
reduce_sum	ODE	64	16	5	6.95	22.43
reduce_sum	ODE	128	1	5	1.00	252.76
reduce_sum	ODE	128	2	5	1.76	144.26
reduce_sum	ODE	128	4	5	3.61	71.26
reduce_sum	ODE	128	8	5	4.91	51.70
reduce_sum	ODE	128	16	5	6.34	40.73

What is really cool to see is that running times of more than 4h go down to well below one hour.

It is a bit surprising to see the ODE runs being ceiled off in terms of speedup, but when looking at the wall time decrease things look sensible to me.

I definitely want this in Stan soon!

Topic		Replies	Views
Stanc3 parallel reduce_sum Developers	21	1110	April 9, 2020
Parallel reduce in the Stan language Developers	12	1232	April 11, 2019
Why does reduce_sum include the second argument? Modeling specification	7	588	May 23, 2020
Proposed parallelism RFC - Stan language bits Developers	14	1027	July 9, 2019
Variable scope & reduce_sum General	6	398	October 7, 2020

Parallel autodiff v4

Related topics