Parallel autodiff v4

Can you do fake data for this model? Or just copy-paste a bunch of subjects?

I digged out some old code which I created for map_rect benchmarking. This code simulates fake data for a 2-cmt PK system with oral dosing. It supports:

  • Flexible number of subjects, J = 128
  • Varying methods of solving the problem, solver = 0 for analytic, solver = 1 for matrix exponential and solver = 2 for ODE solver

I am running things now on my 4 core laptop with 1, 2 and 3 cores to avoid competition with the OS processes. This is what I get for 25/25:

rs-solver_0-1.log:real 18.00  1 core
rs-solver_0-2.log:real 9.52    2 cores
rs-solver_0-3.log:real 7.07    3 cores
rs-solver_1-1.log:real 111.75
rs-solver_1-2.log:real 61.86
rs-solver_1-3.log:real 47.39
rs-solver_2-1.log:real 426.15
rs-solver_2-2.log:real 216.53
rs-solver_2-3.log:real 168.70

Doubling to J=256 and only doing analytic solves gives me

rs-solver_0-1.log:real 58.38
rs-solver_0-2.log:real 32.59
rs-solver_0-3.log:real 23.52

To me this is good scaling behaviour.

Here are the codes to run this:

oral_2cmt_async.stan (15.5 KB)
oral4_stan-base.R (351 Bytes)

here is the bash script controlling this

#!/bin/bash

CHAIN=1

CORES=${1-1}

export STAN_NUM_THREADS=$CORES

echo "USING $CORES CPU CORES:"

J=128
J=256
NITER=25

## 0 = analytic; 1 = matrix exponential; 2 = ODE
SOLVER=${2-0}

cat <(cat oral4_stan-base.R) <(printf "J <- $J\nmethod <- 0\nsolver <- ${SOLVER}\n") > oral4_stan-$J-0.R
cat <(cat oral4_stan-base.R) <(printf "J <- $J\nmethod <- 1\nsolver <- ${SOLVER}\n") > oral4_stan-$J-1.R

MODEL=./oral_2cmt_async

## reduce_sum version
(time -p $MODEL sample num_samples=$NITER num_warmup=$NITER save_warmup=1 algorithm=hmc stepsize=0.01 adapt delta=0.9 id=$CHAIN  data file=oral4_stan-$J-1.R init=0.1 random seed=10 output refresh=10 file=samples5-$CHAIN-rect-$J-0-solver_${SOLVER}-${CORES}.csv) 2>&1 | tee rs-solver_${SOLVER}-${CORES}.log
2 Likes

This looks like a good test model thanks!

BTW, it could be that the 1 core run is faster than it should be since processors can apply Turbo Boost when there is not too much load on the system otherwise. Turbo Boost allows to over-clock the CPU temporarily when needed.

Also: Do you think it would be valuable to throw this onto a cluster and do a more systematic assessment? It’s a bit of work to code that up, but maybe it’s worthwhile.

Yeah. I think it’d be worth the time. Can you make sure before you shoot these off that the NITER=25 timings are reliable enough? Mine were starting to get noisy.

I was surprised how the base thing didn’t scale to 3x for the ODE solve even though it’s taking a lot of time (and we’d expect the overhead to be low cause the analytical solution also has that overhead).

In that spirit, I guess I’d like to know what happens with:

  1. More threads running in parallel (as many cores as you have easy access to on one computer)

  2. Many single chains running in parallel (again, up to as many cores as you have on a computer). We’d expect the runtime to roughly match the single chain running alone if memory isn’t an issue

#2 should sorta act like a reference to the limits of our speedups?

Oh good point. I guess we should be comparing N threads against N parallel chains.

Ok. I spend the day with setting this up carefully such that more examples could be explored if we wanted to.

I tried out different grain sizes, but quickly found that grainsize=0 is best for this problem, so this is not varied.

Here are the comparisons:

In numbers this is:

solver num_subjects grainsize cores runs mean_speedup mean_runtime
analytic 64 0 1 5 1.00 12.04
analytic 64 0 2 5 1.79 6.76
analytic 64 0 4 5 2.74 4.40
analytic 64 0 8 5 4.05 2.97
analytic 64 0 16 5 4.83 2.50
analytic 128 0 1 5 1.00 45.51
analytic 128 0 2 5 1.80 25.41
analytic 128 0 4 5 3.11 14.68
analytic 128 0 8 5 4.74 9.61
analytic 128 0 16 5 5.31 8.57
matrixExp 64 0 1 5 1.00 51.30
matrixExp 64 0 2 5 1.90 27.05
matrixExp 64 0 4 5 3.58 14.33
matrixExp 64 0 8 5 6.51 7.88
matrixExp 64 0 16 5 12.43 4.13
matrixExp 128 0 1 5 1.00 199.85
matrixExp 128 0 2 5 1.96 101.91
matrixExp 128 0 4 5 3.76 53.12
matrixExp 128 0 8 5 7.05 28.36
matrixExp 128 0 16 5 12.41 16.10

Right now a batch with 256 subjects is running - I will post that if it is interesting, but I doubt that it brings up new trends.

The J=128 problem with the matrix exponential is cool: 200s => 16s. The above is with 5 replications per case and I did run 50 warmup / 50 iterations.

2 Likes

Yeah this is what I was more hoping for lol.

Looks good and it’s all scaling something like we’d hope.

Do we have the map_rect comparison numbers?

Do we have the ode numbers?

Giving the finger…grabbing the arm… things still run on our cluster as the single core ODE runs take an enormous amount of time. As I have to guess the total runtime in advance I may need to restart those runs again (after the pre-specified runtime jobs get killed on our cluster).

Anyway… in the meantime I can post what I have now. As the single-core runs were not ready yet for some ODE runs, I just multiplied the 2-core run by a factor of 2 to approximate the single-core runtime (which makes scaling look very good for 2 & 4 core runs) - this should give a good idea of things already.

method solver num_subjects grainsize cores runs mean_speedup mean_runtime
map_rect analytic 64 0 1 5 1.00 57.04
map_rect analytic 64 0 2 5 1.83 31.15
map_rect analytic 64 0 4 5 3.02 19.07
map_rect analytic 64 0 8 5 4.66 12.26
map_rect analytic 64 0 16 5 5.43 10.51
map_rect analytic 128 0 1 5 1.00 75.44
map_rect analytic 128 0 2 5 1.76 42.95
map_rect analytic 128 0 4 5 3.08 24.50
map_rect analytic 128 0 8 5 4.58 16.47
map_rect analytic 128 0 16 5 5.30 14.23
map_rect matrixExp 64 0 1 5 1.00 125.80
map_rect matrixExp 64 0 2 5 1.90 66.40
map_rect matrixExp 64 0 4 5 3.59 35.08
map_rect matrixExp 64 0 8 5 6.44 19.61
map_rect matrixExp 64 0 16 5 12.10 10.46
map_rect matrixExp 128 0 1 5 1.00 267.16
map_rect matrixExp 128 0 2 5 1.94 137.61
map_rect matrixExp 128 0 4 5 3.70 72.31
map_rect matrixExp 128 0 8 5 7.06 37.86
map_rect matrixExp 128 0 16 5 12.30 21.73
map_rect ODE 64 0 2 5 2.00 5999.23
map_rect ODE 64 0 4 5 3.78 3176.32
map_rect ODE 64 0 8 5 6.55 1833.21
map_rect ODE 64 0 16 5 9.53 1260.13
map_rect ODE 128 0 8 3 8.00 3241.48
map_rect ODE 128 0 16 5 11.64 2228.40
reduce_sum analytic 64 0 1 5 1.00 47.84
reduce_sum analytic 64 0 2 5 1.75 28.03
reduce_sum analytic 64 0 4 5 3.45 13.99
reduce_sum analytic 64 0 8 5 5.68 8.60
reduce_sum analytic 64 0 16 5 5.78 8.64
reduce_sum analytic 128 0 1 5 1.00 62.33
reduce_sum analytic 128 0 2 5 1.85 33.73
reduce_sum analytic 128 0 4 5 3.26 19.25
reduce_sum analytic 128 0 8 5 5.16 12.10
reduce_sum analytic 128 0 16 5 5.89 10.61
reduce_sum matrixExp 64 0 1 5 1.00 201.00
reduce_sum matrixExp 64 0 2 5 2.63 76.59
reduce_sum matrixExp 64 0 4 5 5.41 37.93
reduce_sum matrixExp 64 0 8 5 9.65 20.90
reduce_sum matrixExp 64 0 16 5 19.73 10.29
reduce_sum matrixExp 128 0 1 5 1.00 243.10
reduce_sum matrixExp 128 0 2 5 1.92 126.43
reduce_sum matrixExp 128 0 4 5 3.68 66.21
reduce_sum matrixExp 128 0 8 5 6.90 35.40
reduce_sum matrixExp 128 0 16 5 12.18 19.98
reduce_sum ODE 64 0 2 5 2.02 5937.02
reduce_sum ODE 64 0 4 5 3.87 3069.94
reduce_sum ODE 64 0 8 5 6.36 1870.26
reduce_sum ODE 64 0 16 5 8.77 1355.23
reduce_sum ODE 128 0 8 4 8.03 3037.93
reduce_sum ODE 128 0 16 5 10.95 2227.58

The ODE solver now uses an Adams-Moulton non-stiff solver.

Overall these are nice results to me. I am not sure why we get speedups above the number of cores for 64 patients for reduce_sum for the matrix exponential, but maybe this is an artefact or for real since things fit better into CPU caches (I don’t mind a bias towards better speeds).

If it isn’t too much pain to get the final results and easy to update this post, then I will edit this later.

Yack - I had to rerun the entire thing as the single core ODE runs did blow the resource limits I had set. Then I had to find out that longer runs can only run on a different CPU architecture on our cluster. Thus, the results below are now on a newer and faster CPU (Intel® Xeon® CPU E5-2640 v4 @ 2.40GHz).

And the mean results (time is now in minutes):

method solver J cores runs mean_speedup mean_runtime
map_rect analytic 64 1 5 1.00 0.77
map_rect analytic 64 2 5 1.83 0.42
map_rect analytic 64 4 5 2.81 0.27
map_rect analytic 64 8 5 3.75 0.21
map_rect analytic 64 16 5 3.74 0.21
map_rect analytic 128 1 5 1.00 1.04
map_rect analytic 128 2 5 1.69 0.62
map_rect analytic 128 4 5 2.96 0.35
map_rect analytic 128 8 5 3.92 0.27
map_rect analytic 128 16 5 3.70 0.28
map_rect matrixExp 64 1 5 1.00 1.88
map_rect matrixExp 64 2 5 1.90 0.99
map_rect matrixExp 64 4 5 3.53 0.54
map_rect matrixExp 64 8 5 6.50 0.29
map_rect matrixExp 64 16 5 11.51 0.16
map_rect matrixExp 128 1 5 1.00 4.03
map_rect matrixExp 128 2 5 1.97 2.04
map_rect matrixExp 128 4 5 3.87 1.04
map_rect matrixExp 128 8 5 6.97 0.58
map_rect matrixExp 128 16 5 12.03 0.34
map_rect ODE 64 1 5 1.00 154.12
map_rect ODE 64 2 5 1.86 82.88
map_rect ODE 64 4 5 3.39 45.49
map_rect ODE 64 8 5 5.67 27.21
map_rect ODE 64 16 5 6.90 22.36
map_rect ODE 128 1 5 1.00 286.29
map_rect ODE 128 2 5 1.90 150.61
map_rect ODE 128 4 5 3.65 78.59
map_rect ODE 128 8 5 6.11 46.88
map_rect ODE 128 16 5 7.21 39.71
reduce_sum analytic 64 1 5 1.00 0.65
reduce_sum analytic 64 2 5 1.83 0.36
reduce_sum analytic 64 4 5 2.99 0.22
reduce_sum analytic 64 8 5 4.09 0.16
reduce_sum analytic 64 16 5 3.99 0.16
reduce_sum analytic 128 1 5 1.00 0.87
reduce_sum analytic 128 2 5 1.91 0.46
reduce_sum analytic 128 4 5 3.25 0.27
reduce_sum analytic 128 8 5 4.03 0.22
reduce_sum analytic 128 16 5 4.19 0.21
reduce_sum matrixExp 64 1 5 1.00 2.85
reduce_sum matrixExp 64 2 5 2.58 1.11
reduce_sum matrixExp 64 4 5 6.06 0.47
reduce_sum matrixExp 64 8 5 8.62 0.33
reduce_sum matrixExp 64 16 5 18.81 0.15
reduce_sum matrixExp 128 1 5 1.00 3.65
reduce_sum matrixExp 128 2 5 1.95 1.87
reduce_sum matrixExp 128 4 5 3.78 0.97
reduce_sum matrixExp 128 8 5 6.55 0.56
reduce_sum matrixExp 128 16 5 11.88 0.31
reduce_sum ODE 64 1 5 1.00 155.73
reduce_sum ODE 64 2 5 2.18 72.22
reduce_sum ODE 64 4 5 3.43 45.63
reduce_sum ODE 64 8 5 5.80 26.92
reduce_sum ODE 64 16 5 6.95 22.43
reduce_sum ODE 128 1 5 1.00 252.76
reduce_sum ODE 128 2 5 1.76 144.26
reduce_sum ODE 128 4 5 3.61 71.26
reduce_sum ODE 128 8 5 4.91 51.70
reduce_sum ODE 128 16 5 6.34 40.73

What is really cool to see is that running times of more than 4h go down to well below one hour.

It is a bit surprising to see the ODE runs being ceiled off in terms of speedup, but when looking at the wall time decrease things look sensible to me.

I definitely want this in Stan soon!

3 Likes

Can you do the plots with map_rect/reduce_sum as colors and 64/128 as x-facets? This is really nice info thanks for putting this together.

It’s surprising to see any differences between map_rect and reduce_sum… the model is written such that for map_rect each function call just returns a vector of length 1 which is the log-lik.

Still, there are some minor differences if I plot it like this:

1 Like

Yeah I’m happy from these plots that it’s running at least as efficiently as map_rect, which is good to know.

Maybe it’s the extra stuff @stevebronder put in cutting the overhead in the smaller cases.

Would it be difficult for you to add these examples to the cmdstan performance testing suite? Would be cool to have easy to spin up but also don’t want you to waste cycles if it would be annoying

That can be done. The model is very flexible in terms of what is being calculated. That can be controlled through changing the data given to the model.

Is there some guidance/doc on how to add a model?

nope! Just sorta go for it I think. You can make a folder at the top level and it’s probably fine

@bbbales2

Do you remember what features you used of vtune to investigate this?

-Andre

Profiling this code, all the “check_nan” err stuff really kills performance… I know this isn’t relevant to this thread…