Yack - I had to rerun the entire thing as the single core ODE runs did blow the resource limits I had set. Then I had to find out that longer runs can only run on a different CPU architecture on our cluster. Thus, the results below are now on a newer and faster CPU (Intel® Xeon® CPU E5-2640 v4 @ 2.40GHz).
And the mean results (time is now in minutes):
method | solver | J | cores | runs | mean_speedup | mean_runtime |
---|---|---|---|---|---|---|
map_rect | analytic | 64 | 1 | 5 | 1.00 | 0.77 |
map_rect | analytic | 64 | 2 | 5 | 1.83 | 0.42 |
map_rect | analytic | 64 | 4 | 5 | 2.81 | 0.27 |
map_rect | analytic | 64 | 8 | 5 | 3.75 | 0.21 |
map_rect | analytic | 64 | 16 | 5 | 3.74 | 0.21 |
map_rect | analytic | 128 | 1 | 5 | 1.00 | 1.04 |
map_rect | analytic | 128 | 2 | 5 | 1.69 | 0.62 |
map_rect | analytic | 128 | 4 | 5 | 2.96 | 0.35 |
map_rect | analytic | 128 | 8 | 5 | 3.92 | 0.27 |
map_rect | analytic | 128 | 16 | 5 | 3.70 | 0.28 |
map_rect | matrixExp | 64 | 1 | 5 | 1.00 | 1.88 |
map_rect | matrixExp | 64 | 2 | 5 | 1.90 | 0.99 |
map_rect | matrixExp | 64 | 4 | 5 | 3.53 | 0.54 |
map_rect | matrixExp | 64 | 8 | 5 | 6.50 | 0.29 |
map_rect | matrixExp | 64 | 16 | 5 | 11.51 | 0.16 |
map_rect | matrixExp | 128 | 1 | 5 | 1.00 | 4.03 |
map_rect | matrixExp | 128 | 2 | 5 | 1.97 | 2.04 |
map_rect | matrixExp | 128 | 4 | 5 | 3.87 | 1.04 |
map_rect | matrixExp | 128 | 8 | 5 | 6.97 | 0.58 |
map_rect | matrixExp | 128 | 16 | 5 | 12.03 | 0.34 |
map_rect | ODE | 64 | 1 | 5 | 1.00 | 154.12 |
map_rect | ODE | 64 | 2 | 5 | 1.86 | 82.88 |
map_rect | ODE | 64 | 4 | 5 | 3.39 | 45.49 |
map_rect | ODE | 64 | 8 | 5 | 5.67 | 27.21 |
map_rect | ODE | 64 | 16 | 5 | 6.90 | 22.36 |
map_rect | ODE | 128 | 1 | 5 | 1.00 | 286.29 |
map_rect | ODE | 128 | 2 | 5 | 1.90 | 150.61 |
map_rect | ODE | 128 | 4 | 5 | 3.65 | 78.59 |
map_rect | ODE | 128 | 8 | 5 | 6.11 | 46.88 |
map_rect | ODE | 128 | 16 | 5 | 7.21 | 39.71 |
reduce_sum | analytic | 64 | 1 | 5 | 1.00 | 0.65 |
reduce_sum | analytic | 64 | 2 | 5 | 1.83 | 0.36 |
reduce_sum | analytic | 64 | 4 | 5 | 2.99 | 0.22 |
reduce_sum | analytic | 64 | 8 | 5 | 4.09 | 0.16 |
reduce_sum | analytic | 64 | 16 | 5 | 3.99 | 0.16 |
reduce_sum | analytic | 128 | 1 | 5 | 1.00 | 0.87 |
reduce_sum | analytic | 128 | 2 | 5 | 1.91 | 0.46 |
reduce_sum | analytic | 128 | 4 | 5 | 3.25 | 0.27 |
reduce_sum | analytic | 128 | 8 | 5 | 4.03 | 0.22 |
reduce_sum | analytic | 128 | 16 | 5 | 4.19 | 0.21 |
reduce_sum | matrixExp | 64 | 1 | 5 | 1.00 | 2.85 |
reduce_sum | matrixExp | 64 | 2 | 5 | 2.58 | 1.11 |
reduce_sum | matrixExp | 64 | 4 | 5 | 6.06 | 0.47 |
reduce_sum | matrixExp | 64 | 8 | 5 | 8.62 | 0.33 |
reduce_sum | matrixExp | 64 | 16 | 5 | 18.81 | 0.15 |
reduce_sum | matrixExp | 128 | 1 | 5 | 1.00 | 3.65 |
reduce_sum | matrixExp | 128 | 2 | 5 | 1.95 | 1.87 |
reduce_sum | matrixExp | 128 | 4 | 5 | 3.78 | 0.97 |
reduce_sum | matrixExp | 128 | 8 | 5 | 6.55 | 0.56 |
reduce_sum | matrixExp | 128 | 16 | 5 | 11.88 | 0.31 |
reduce_sum | ODE | 64 | 1 | 5 | 1.00 | 155.73 |
reduce_sum | ODE | 64 | 2 | 5 | 2.18 | 72.22 |
reduce_sum | ODE | 64 | 4 | 5 | 3.43 | 45.63 |
reduce_sum | ODE | 64 | 8 | 5 | 5.80 | 26.92 |
reduce_sum | ODE | 64 | 16 | 5 | 6.95 | 22.43 |
reduce_sum | ODE | 128 | 1 | 5 | 1.00 | 252.76 |
reduce_sum | ODE | 128 | 2 | 5 | 1.76 | 144.26 |
reduce_sum | ODE | 128 | 4 | 5 | 3.61 | 71.26 |
reduce_sum | ODE | 128 | 8 | 5 | 4.91 | 51.70 |
reduce_sum | ODE | 128 | 16 | 5 | 6.34 | 40.73 |
What is really cool to see is that running times of more than 4h go down to well below one hour.
It is a bit surprising to see the ODE runs being ceiled off in terms of speedup, but when looking at the wall time decrease things look sensible to me.
I definitely want this in Stan soon!