Parallel autodiff v4

Yack - I had to rerun the entire thing as the single core ODE runs did blow the resource limits I had set. Then I had to find out that longer runs can only run on a different CPU architecture on our cluster. Thus, the results below are now on a newer and faster CPU (Intel® Xeon® CPU E5-2640 v4 @ 2.40GHz).

And the mean results (time is now in minutes):

method solver J cores runs mean_speedup mean_runtime
map_rect analytic 64 1 5 1.00 0.77
map_rect analytic 64 2 5 1.83 0.42
map_rect analytic 64 4 5 2.81 0.27
map_rect analytic 64 8 5 3.75 0.21
map_rect analytic 64 16 5 3.74 0.21
map_rect analytic 128 1 5 1.00 1.04
map_rect analytic 128 2 5 1.69 0.62
map_rect analytic 128 4 5 2.96 0.35
map_rect analytic 128 8 5 3.92 0.27
map_rect analytic 128 16 5 3.70 0.28
map_rect matrixExp 64 1 5 1.00 1.88
map_rect matrixExp 64 2 5 1.90 0.99
map_rect matrixExp 64 4 5 3.53 0.54
map_rect matrixExp 64 8 5 6.50 0.29
map_rect matrixExp 64 16 5 11.51 0.16
map_rect matrixExp 128 1 5 1.00 4.03
map_rect matrixExp 128 2 5 1.97 2.04
map_rect matrixExp 128 4 5 3.87 1.04
map_rect matrixExp 128 8 5 6.97 0.58
map_rect matrixExp 128 16 5 12.03 0.34
map_rect ODE 64 1 5 1.00 154.12
map_rect ODE 64 2 5 1.86 82.88
map_rect ODE 64 4 5 3.39 45.49
map_rect ODE 64 8 5 5.67 27.21
map_rect ODE 64 16 5 6.90 22.36
map_rect ODE 128 1 5 1.00 286.29
map_rect ODE 128 2 5 1.90 150.61
map_rect ODE 128 4 5 3.65 78.59
map_rect ODE 128 8 5 6.11 46.88
map_rect ODE 128 16 5 7.21 39.71
reduce_sum analytic 64 1 5 1.00 0.65
reduce_sum analytic 64 2 5 1.83 0.36
reduce_sum analytic 64 4 5 2.99 0.22
reduce_sum analytic 64 8 5 4.09 0.16
reduce_sum analytic 64 16 5 3.99 0.16
reduce_sum analytic 128 1 5 1.00 0.87
reduce_sum analytic 128 2 5 1.91 0.46
reduce_sum analytic 128 4 5 3.25 0.27
reduce_sum analytic 128 8 5 4.03 0.22
reduce_sum analytic 128 16 5 4.19 0.21
reduce_sum matrixExp 64 1 5 1.00 2.85
reduce_sum matrixExp 64 2 5 2.58 1.11
reduce_sum matrixExp 64 4 5 6.06 0.47
reduce_sum matrixExp 64 8 5 8.62 0.33
reduce_sum matrixExp 64 16 5 18.81 0.15
reduce_sum matrixExp 128 1 5 1.00 3.65
reduce_sum matrixExp 128 2 5 1.95 1.87
reduce_sum matrixExp 128 4 5 3.78 0.97
reduce_sum matrixExp 128 8 5 6.55 0.56
reduce_sum matrixExp 128 16 5 11.88 0.31
reduce_sum ODE 64 1 5 1.00 155.73
reduce_sum ODE 64 2 5 2.18 72.22
reduce_sum ODE 64 4 5 3.43 45.63
reduce_sum ODE 64 8 5 5.80 26.92
reduce_sum ODE 64 16 5 6.95 22.43
reduce_sum ODE 128 1 5 1.00 252.76
reduce_sum ODE 128 2 5 1.76 144.26
reduce_sum ODE 128 4 5 3.61 71.26
reduce_sum ODE 128 8 5 4.91 51.70
reduce_sum ODE 128 16 5 6.34 40.73

What is really cool to see is that running times of more than 4h go down to well below one hour.

It is a bit surprising to see the ODE runs being ceiled off in terms of speedup, but when looking at the wall time decrease things look sensible to me.

I definitely want this in Stan soon!

3 Likes