Parallel autodiff v4

bbbales2 · March 13, 2020, 8:21pm

Can you do fake data for this model? Or just copy-paste a bunch of subjects?

wds15 · March 14, 2020, 12:53pm

I digged out some old code which I created for map_rect benchmarking. This code simulates fake data for a 2-cmt PK system with oral dosing. It supports:

Flexible number of subjects, J = 128
Varying methods of solving the problem, solver = 0 for analytic, solver = 1 for matrix exponential and solver = 2 for ODE solver

I am running things now on my 4 core laptop with 1, 2 and 3 cores to avoid competition with the OS processes. This is what I get for 25/25:

rs-solver_0-1.log:real 18.00  1 core
rs-solver_0-2.log:real 9.52    2 cores
rs-solver_0-3.log:real 7.07    3 cores
rs-solver_1-1.log:real 111.75
rs-solver_1-2.log:real 61.86
rs-solver_1-3.log:real 47.39
rs-solver_2-1.log:real 426.15
rs-solver_2-2.log:real 216.53
rs-solver_2-3.log:real 168.70

Doubling to J=256 and only doing analytic solves gives me

rs-solver_0-1.log:real 58.38
rs-solver_0-2.log:real 32.59
rs-solver_0-3.log:real 23.52

To me this is good scaling behaviour.

Here are the codes to run this:

oral_2cmt_async.stan (15.5 KB)
oral4_stan-base.R (351 Bytes)

here is the bash script controlling this

#!/bin/bash

CHAIN=1

CORES=${1-1}

export STAN_NUM_THREADS=$CORES

echo "USING $CORES CPU CORES:"

J=128
J=256
NITER=25

## 0 = analytic; 1 = matrix exponential; 2 = ODE
SOLVER=${2-0}

cat <(cat oral4_stan-base.R) <(printf "J <- $J\nmethod <- 0\nsolver <- ${SOLVER}\n") > oral4_stan-$J-0.R
cat <(cat oral4_stan-base.R) <(printf "J <- $J\nmethod <- 1\nsolver <- ${SOLVER}\n") > oral4_stan-$J-1.R

MODEL=./oral_2cmt_async

## reduce_sum version
(time -p $MODEL sample num_samples=$NITER num_warmup=$NITER save_warmup=1 algorithm=hmc stepsize=0.01 adapt delta=0.9 id=$CHAIN  data file=oral4_stan-$J-1.R init=0.1 random seed=10 output refresh=10 file=samples5-$CHAIN-rect-$J-0-solver_${SOLVER}-${CORES}.csv) 2>&1 | tee rs-solver_${SOLVER}-${CORES}.log

bbbales2 · March 14, 2020, 3:01pm

This looks like a good test model thanks!

wds15 · March 15, 2020, 10:48am

BTW, it could be that the 1 core run is faster than it should be since processors can apply Turbo Boost when there is not too much load on the system otherwise. Turbo Boost allows to over-clock the CPU temporarily when needed.

Also: Do you think it would be valuable to throw this onto a cluster and do a more systematic assessment? It’s a bit of work to code that up, but maybe it’s worthwhile.

bbbales2 · March 15, 2020, 2:52pm

Yeah. I think it’d be worth the time. Can you make sure before you shoot these off that the NITER=25 timings are reliable enough? Mine were starting to get noisy.

I was surprised how the base thing didn’t scale to 3x for the ODE solve even though it’s taking a lot of time (and we’d expect the overhead to be low cause the analytical solution also has that overhead).

In that spirit, I guess I’d like to know what happens with:

More threads running in parallel (as many cores as you have easy access to on one computer)
Many single chains running in parallel (again, up to as many cores as you have on a computer). We’d expect the runtime to roughly match the single chain running alone if memory isn’t an issue

#2 should sorta act like a reference to the limits of our speedups?

bbbales2 · March 15, 2020, 2:53pm

Oh good point. I guess we should be comparing N threads against N parallel chains.

wds15 · March 16, 2020, 5:12pm

Ok. I spend the day with setting this up carefully such that more examples could be explored if we wanted to.

I tried out different grain sizes, but quickly found that grainsize=0 is best for this problem, so this is not varied.

Here are the comparisons:

In numbers this is:

solver	num_subjects	cores	runs	mean_speedup	mean_runtime
analytic	64	1	5	1.00	12.04
analytic	64	2	5	1.79	6.76
analytic	64	4	5	2.74	4.40
analytic	64	8	5	4.05	2.97
analytic	64	16	5	4.83	2.50
analytic	128	1	5	1.00	45.51
analytic	128	2	5	1.80	25.41
analytic	128	4	5	3.11	14.68
analytic	128	8	5	4.74	9.61
analytic	128	16	5	5.31	8.57
matrixExp	64	1	5	1.00	51.30
matrixExp	64	2	5	1.90	27.05
matrixExp	64	4	5	3.58	14.33
matrixExp	64	8	5	6.51	7.88
matrixExp	64	16	5	12.43	4.13
matrixExp	128	1	5	1.00	199.85
matrixExp	128	2	5	1.96	101.91
matrixExp	128	4	5	3.76	53.12
matrixExp	128	8	5	7.05	28.36
matrixExp	128	16	5	12.41	16.10

Right now a batch with 256 subjects is running - I will post that if it is interesting, but I doubt that it brings up new trends.

The J=128 problem with the matrix exponential is cool: 200s => 16s. The above is with 5 replications per case and I did run 50 warmup / 50 iterations.

bbbales2 · March 16, 2020, 6:33pm

Yeah this is what I was more hoping for lol.

Looks good and it’s all scaling something like we’d hope.

Do we have the map_rect comparison numbers?

Do we have the ode numbers?

wds15 · March 17, 2020, 12:37pm

Giving the finger…grabbing the arm… things still run on our cluster as the single core ODE runs take an enormous amount of time. As I have to guess the total runtime in advance I may need to restart those runs again (after the pre-specified runtime jobs get killed on our cluster).

Anyway… in the meantime I can post what I have now. As the single-core runs were not ready yet for some ODE runs, I just multiplied the 2-core run by a factor of 2 to approximate the single-core runtime (which makes scaling look very good for 2 & 4 core runs) - this should give a good idea of things already.

method	solver	num_subjects	cores	runs	mean_speedup	mean_runtime
map_rect	analytic	64	1	5	1.00	57.04
map_rect	analytic	64	2	5	1.83	31.15
map_rect	analytic	64	4	5	3.02	19.07
map_rect	analytic	64	8	5	4.66	12.26
map_rect	analytic	64	16	5	5.43	10.51
map_rect	analytic	128	1	5	1.00	75.44
map_rect	analytic	128	2	5	1.76	42.95
map_rect	analytic	128	4	5	3.08	24.50
map_rect	analytic	128	8	5	4.58	16.47
map_rect	analytic	128	16	5	5.30	14.23
map_rect	matrixExp	64	1	5	1.00	125.80
map_rect	matrixExp	64	2	5	1.90	66.40
map_rect	matrixExp	64	4	5	3.59	35.08
map_rect	matrixExp	64	8	5	6.44	19.61
map_rect	matrixExp	64	16	5	12.10	10.46
map_rect	matrixExp	128	1	5	1.00	267.16
map_rect	matrixExp	128	2	5	1.94	137.61
map_rect	matrixExp	128	4	5	3.70	72.31
map_rect	matrixExp	128	8	5	7.06	37.86
map_rect	matrixExp	128	16	5	12.30	21.73
map_rect	ODE	64	2	5	2.00	5999.23
map_rect	ODE	64	4	5	3.78	3176.32
map_rect	ODE	64	8	5	6.55	1833.21
map_rect	ODE	64	16	5	9.53	1260.13
map_rect	ODE	128	8	3	8.00	3241.48
map_rect	ODE	128	16	5	11.64	2228.40
reduce_sum	analytic	64	1	5	1.00	47.84
reduce_sum	analytic	64	2	5	1.75	28.03
reduce_sum	analytic	64	4	5	3.45	13.99
reduce_sum	analytic	64	8	5	5.68	8.60
reduce_sum	analytic	64	16	5	5.78	8.64
reduce_sum	analytic	128	1	5	1.00	62.33
reduce_sum	analytic	128	2	5	1.85	33.73
reduce_sum	analytic	128	4	5	3.26	19.25
reduce_sum	analytic	128	8	5	5.16	12.10
reduce_sum	analytic	128	16	5	5.89	10.61
reduce_sum	matrixExp	64	1	5	1.00	201.00
reduce_sum	matrixExp	64	2	5	2.63	76.59
reduce_sum	matrixExp	64	4	5	5.41	37.93
reduce_sum	matrixExp	64	8	5	9.65	20.90
reduce_sum	matrixExp	64	16	5	19.73	10.29
reduce_sum	matrixExp	128	1	5	1.00	243.10
reduce_sum	matrixExp	128	2	5	1.92	126.43
reduce_sum	matrixExp	128	4	5	3.68	66.21
reduce_sum	matrixExp	128	8	5	6.90	35.40
reduce_sum	matrixExp	128	16	5	12.18	19.98
reduce_sum	ODE	64	2	5	2.02	5937.02
reduce_sum	ODE	64	4	5	3.87	3069.94
reduce_sum	ODE	64	8	5	6.36	1870.26
reduce_sum	ODE	64	16	5	8.77	1355.23
reduce_sum	ODE	128	8	4	8.03	3037.93
reduce_sum	ODE	128	16	5	10.95	2227.58

The ODE solver now uses an Adams-Moulton non-stiff solver.

Overall these are nice results to me. I am not sure why we get speedups above the number of cores for 64 patients for reduce_sum for the matrix exponential, but maybe this is an artefact or for real since things fit better into CPU caches (I don’t mind a bias towards better speeds).

If it isn’t too much pain to get the final results and easy to update this post, then I will edit this later.

wds15 · March 18, 2020, 8:53am

Yack - I had to rerun the entire thing as the single core ODE runs did blow the resource limits I had set. Then I had to find out that longer runs can only run on a different CPU architecture on our cluster. Thus, the results below are now on a newer and faster CPU (Intel® Xeon® CPU E5-2640 v4 @ 2.40GHz).

And the mean results (time is now in minutes):

method	solver	J	cores	runs	mean_speedup	mean_runtime
map_rect	analytic	64	1	5	1.00	0.77
map_rect	analytic	64	2	5	1.83	0.42
map_rect	analytic	64	4	5	2.81	0.27
map_rect	analytic	64	8	5	3.75	0.21
map_rect	analytic	64	16	5	3.74	0.21
map_rect	analytic	128	1	5	1.00	1.04
map_rect	analytic	128	2	5	1.69	0.62
map_rect	analytic	128	4	5	2.96	0.35
map_rect	analytic	128	8	5	3.92	0.27
map_rect	analytic	128	16	5	3.70	0.28
map_rect	matrixExp	64	1	5	1.00	1.88
map_rect	matrixExp	64	2	5	1.90	0.99
map_rect	matrixExp	64	4	5	3.53	0.54
map_rect	matrixExp	64	8	5	6.50	0.29
map_rect	matrixExp	64	16	5	11.51	0.16
map_rect	matrixExp	128	1	5	1.00	4.03
map_rect	matrixExp	128	2	5	1.97	2.04
map_rect	matrixExp	128	4	5	3.87	1.04
map_rect	matrixExp	128	8	5	6.97	0.58
map_rect	matrixExp	128	16	5	12.03	0.34
map_rect	ODE	64	1	5	1.00	154.12
map_rect	ODE	64	2	5	1.86	82.88
map_rect	ODE	64	4	5	3.39	45.49
map_rect	ODE	64	8	5	5.67	27.21
map_rect	ODE	64	16	5	6.90	22.36
map_rect	ODE	128	1	5	1.00	286.29
map_rect	ODE	128	2	5	1.90	150.61
map_rect	ODE	128	4	5	3.65	78.59
map_rect	ODE	128	8	5	6.11	46.88
map_rect	ODE	128	16	5	7.21	39.71
reduce_sum	analytic	64	1	5	1.00	0.65
reduce_sum	analytic	64	2	5	1.83	0.36
reduce_sum	analytic	64	4	5	2.99	0.22
reduce_sum	analytic	64	8	5	4.09	0.16
reduce_sum	analytic	64	16	5	3.99	0.16
reduce_sum	analytic	128	1	5	1.00	0.87
reduce_sum	analytic	128	2	5	1.91	0.46
reduce_sum	analytic	128	4	5	3.25	0.27
reduce_sum	analytic	128	8	5	4.03	0.22
reduce_sum	analytic	128	16	5	4.19	0.21
reduce_sum	matrixExp	64	1	5	1.00	2.85
reduce_sum	matrixExp	64	2	5	2.58	1.11
reduce_sum	matrixExp	64	4	5	6.06	0.47
reduce_sum	matrixExp	64	8	5	8.62	0.33
reduce_sum	matrixExp	64	16	5	18.81	0.15
reduce_sum	matrixExp	128	1	5	1.00	3.65
reduce_sum	matrixExp	128	2	5	1.95	1.87
reduce_sum	matrixExp	128	4	5	3.78	0.97
reduce_sum	matrixExp	128	8	5	6.55	0.56
reduce_sum	matrixExp	128	16	5	11.88	0.31
reduce_sum	ODE	64	1	5	1.00	155.73
reduce_sum	ODE	64	2	5	2.18	72.22
reduce_sum	ODE	64	4	5	3.43	45.63
reduce_sum	ODE	64	8	5	5.80	26.92
reduce_sum	ODE	64	16	5	6.95	22.43
reduce_sum	ODE	128	1	5	1.00	252.76
reduce_sum	ODE	128	2	5	1.76	144.26
reduce_sum	ODE	128	4	5	3.61	71.26
reduce_sum	ODE	128	8	5	4.91	51.70
reduce_sum	ODE	128	16	5	6.34	40.73

What is really cool to see is that running times of more than 4h go down to well below one hour.

It is a bit surprising to see the ODE runs being ceiled off in terms of speedup, but when looking at the wall time decrease things look sensible to me.

I definitely want this in Stan soon!

bbbales2 · March 18, 2020, 5:53pm

Can you do the plots with map_rect/reduce_sum as colors and 64/128 as x-facets? This is really nice info thanks for putting this together.

wds15 · March 18, 2020, 6:38pm

It’s surprising to see any differences between map_rect and reduce_sum… the model is written such that for map_rect each function call just returns a vector of length 1 which is the log-lik.

Still, there are some minor differences if I plot it like this:

bbbales2 · March 18, 2020, 7:09pm

Yeah I’m happy from these plots that it’s running at least as efficiently as map_rect, which is good to know.

Maybe it’s the extra stuff @stevebronder put in cutting the overhead in the smaller cases.

stevebronder · March 18, 2020, 8:11pm

Would it be difficult for you to add these examples to the cmdstan performance testing suite? Would be cool to have easy to spin up but also don’t want you to waste cycles if it would be annoying

wds15 · March 19, 2020, 11:50am

That can be done. The model is very flexible in terms of what is being calculated. That can be controlled through changing the data given to the model.

Is there some guidance/doc on how to add a model?

stevebronder · March 19, 2020, 2:13pm

nope! Just sorta go for it I think. You can make a folder at the top level and it’s probably fine

drezap · June 8, 2023, 5:39pm

@bbbales2

Do you remember what features you used of vtune to investigate this?

-Andre

drezap · June 20, 2023, 4:07pm

Profiling this code, all the “check_nan” err stuff really kills performance… I know this isn’t relevant to this thread…

Topic		Replies	Views
Stanc3 parallel reduce_sum Developers	21	1110	April 9, 2020
Parallel reduce in the Stan language Developers	12	1232	April 11, 2019
Why does reduce_sum include the second argument? Modeling specification	7	588	May 23, 2020
Proposed parallelism RFC - Stan language bits Developers	14	1027	July 9, 2019
Variable scope & reduce_sum General	6	398	October 7, 2020

Parallel autodiff v4

Related topics