Model with reduce_sum takes too long

nerpa · August 8, 2020, 9:30am

Hello all. After implementing the within-chain parallelization with reduce_sum in cmdstanpy (with your help), I reached a dead end. My model is pretty complicated (attached) and I fit a lot of data, and even if I run it on many cores (now 25) it takes unreasonably long time to fit (it’s stuck in warmup for two days already). I tried different ways to parallelize and improve the process, but to no avail and I am out of options.
Any suggestion how to improve it, to make it faster or anything will be highly appreciated.
I am using cmdstanpy with cmdstan 2.24, centOS7, g++ 9.3.0.

Thank you!

model_parallel.txt (18.7 KB)

nerpa · August 8, 2020, 11:03am

UPDATE - how big is my model? I have ~25k parameters and a bit more than 2 million observations.

wds15 · August 8, 2020, 3:30pm

That’s huge! Did you measure your speed ups on smaller problem sizes first? Do you have a sense for how long this will take with a single core?

nerpa · August 8, 2020, 3:42pm

Yes it is :) I was building it during the last few months. I was running it on small N of 8-10 on a similar number of cores and it finished within 20 hours. N=20 on 20 cores scaled up to 40 hours. And these are just small portions of my data. Single core processing of N=15 took ~70 hours for the fastest chain (6 days for the last chain). So I think something can be done better with the parallelization, but I am really not sure what. Vectorizing the loops within reduce_sum didn’t give much speed gain.

wds15 · August 8, 2020, 4:13pm

Usually vectorization is super important. If that does not help, then things are odd to me.

Anyway, you should try to

slice parameters used per group; so if your model is a hierachical one, then slice all the parameters needed per group
parameters in the shared arguments list is very expensive, since it’s copied for every call of the partial sum function (hence, slice the parameters as appropriate for your model).
having data in the share arguments is cheap, because its handled by reference (no copy)

Other than that make sure the model runs efficient without reduce_sum first, then go to reduce_sum… I hope that is what happened in your case.

nerpa · August 8, 2020, 4:25pm

You mean I should pass all the arrays I want to slice before I pass grainsize in the partial_sum?

nerpa · August 8, 2020, 4:31pm

Based on the tests I run, the serial version would take more than a month for the full dataset. I am not sure what counts as efficient here, but this is what I;ve got after optimizing each part of my model separately. That’s where I started to implement multithreading.

nerpa · August 8, 2020, 4:34pm

I guess not always.

wds15 · August 8, 2020, 5:54pm

This version with a for loop:

for(i in 1:N) {
   y[i] ~ normal(mu[i], sigma);
}

is massively slower as opposed to

  y ~ normal(mu, sigma);

Now, building up a large mu vector using a for loop is ok, but definitely not calling out by element to the lpdfs.

nerpa · August 8, 2020, 6:00pm

I vectorized these loops where possible in the partial_sum function, but didn’t see significant speedup. This was odd to me, but I left it vectorized.

nerpa · August 8, 2020, 6:16pm

you mean something like this?

for (w in 1:W) {
    target += reduce_sum(partial_sum, parameters1[,w], parameters2[,w], grainsize, data1[,w], data2[,w])	
}

I wish this was possible, but it goes against partial_sum function signature.

wds15 · August 8, 2020, 6:22pm

No. Never call many reduce sum like this if you can avoid it.

Can u annotate your model to see where the 25k params and all that data is? Somehow easier to read model would be nicer.

wds15 · August 8, 2020, 6:34pm

Now I see that you do have a for loop calling multiple reduce sum…that’s not efficient. As a workaround you could try to nest reduce sum calls as a start, but it would be a lot better to have one single layer of reduce sum only.

nerpa · August 8, 2020, 6:38pm

Here is the annotated model. Hope it is clearer. The general design is that participants (N) perform 7 different tasks (each with a well defined computational model) during multiple weeks (W).

model_parallel.txt (19.4 KB)

nerpa · August 8, 2020, 6:40pm

Yes, this was one of my latest tries that also was not successful. Initially I parallelized over N by using reduce_sum_static with grainsize=1 - that is each participant will get a core. But it also was very slow.

nerpa · August 8, 2020, 6:47pm

Now I see that you meant the loop over W . This is unavoidable bc I am modeling a dynamical process. So there is no solution to it?

wds15 · August 9, 2020, 7:42am

How big is N and W?

I would expect that each subject (out of N) is independent of all the others, right? It‘s just that per subject things are correlated over time, the weeks. Is that how your model works?

nerpa · August 9, 2020, 10:01am

yes, exactly. Except for dynamic variables (eg., X) that are shared across subjects. My N is ~100 and W is ~10. You wrote earlier to use a nested version of reduce_sum is an option. What did you mean by it?

wds15 · August 9, 2020, 1:34pm

Then you should really rewrite the model such that you form partial sums over subjects. So how about this plan:

rewrite the model so that it is of the form
for( i in 1:N ) target += subject_log_lik(... subject i ...);
then you write a function forming partial sums over multiple patients
finally you use use reduce_sum

As a last step you can try to slice all those parameters per subject. So the first argument which gets sliced becomes an array of dimension a[N,…] and this contains for a given subject all your parameters needed which are specific to the subject to calculate his log-lik.

nerpa · August 9, 2020, 2:53pm

Sorry for being so slow… You mean I should put the loop over W inside this

for( i in 1:N ) target += subject_log_lik(... subject i ...); ?

If yes, I am not sure I can do that. Some of the parameters are subject-specific, but some parameters are shared across subjects and are updated dynamically with the progression of W.

Topic		Replies	Views
Trying within-chain parallelization with reduce_sum increases runtime a lot Modeling paralellization	6	952	August 2, 2021
Help with reduce_sum Modeling	32	1402	August 4, 2020
Reduce_sum parallelisation issue Modeling cmdstanr , multivariate-normal	12	1032	February 24, 2022
Can't get reduce_sum to help with model runtime Modeling	7	345	December 19, 2023
Reduce_sum performance Modeling performance , paralellization	15	1342	September 28, 2020

Model with reduce_sum takes too long

Related topics