Hello all. After implementing the within-chain parallelization with reduce_sum in cmdstanpy (with your help), I reached a dead end. My model is pretty complicated (attached) and I fit a lot of data, and even if I run it on many cores (now 25) it takes unreasonably long time to fit (it’s stuck in warmup for two days already). I tried different ways to parallelize and improve the process, but to no avail and I am out of options.
Any suggestion how to improve it, to make it faster or anything will be highly appreciated.
I am using cmdstanpy with cmdstan 2.24, centOS7, g++ 9.3.0.
Yes it is :) I was building it during the last few months. I was running it on small N of 8-10 on a similar number of cores and it finished within 20 hours. N=20 on 20 cores scaled up to 40 hours. And these are just small portions of my data. Single core processing of N=15 took ~70 hours for the fastest chain (6 days for the last chain). So I think something can be done better with the parallelization, but I am really not sure what. Vectorizing the loops within reduce_sum didn’t give much speed gain.
Usually vectorization is super important. If that does not help, then things are odd to me.
Anyway, you should try to
slice parameters used per group; so if your model is a hierachical one, then slice all the parameters needed per group
parameters in the shared arguments list is very expensive, since it’s copied for every call of the partial sum function (hence, slice the parameters as appropriate for your model).
having data in the share arguments is cheap, because its handled by reference (no copy)
Other than that make sure the model runs efficient without reduce_sum first, then go to reduce_sum… I hope that is what happened in your case.
Based on the tests I run, the serial version would take more than a month for the full dataset. I am not sure what counts as efficient here, but this is what I;ve got after optimizing each part of my model separately. That’s where I started to implement multithreading.
Now I see that you do have a for loop calling multiple reduce sum…that’s not efficient. As a workaround you could try to nest reduce sum calls as a start, but it would be a lot better to have one single layer of reduce sum only.
Here is the annotated model. Hope it is clearer. The general design is that participants (N) perform 7 different tasks (each with a well defined computational model) during multiple weeks (W).
Yes, this was one of my latest tries that also was not successful. Initially I parallelized over N by using reduce_sum_static with grainsize=1 - that is each participant will get a core. But it also was very slow.
I would expect that each subject (out of N) is independent of all the others, right? It‘s just that per subject things are correlated over time, the weeks. Is that how your model works?
yes, exactly. Except for dynamic variables (eg., X) that are shared across subjects. My N is ~100 and W is ~10. You wrote earlier to use a nested version of reduce_sum is an option. What did you mean by it?
Then you should really rewrite the model such that you form partial sums over subjects. So how about this plan:
rewrite the model so that it is of the form for( i in 1:N ) target += subject_log_lik(... subject i ...);
then you write a function forming partial sums over multiple patients
finally you use use reduce_sum
As a last step you can try to slice all those parameters per subject. So the first argument which gets sliced becomes an array of dimension a[N,…] and this contains for a given subject all your parameters needed which are specific to the subject to calculate his log-lik.
Sorry for being so slow… You mean I should put the loop over W inside this
for( i in 1:N ) target += subject_log_lik(... subject i ...); ?
If yes, I am not sure I can do that. Some of the parameters are subject-specific, but some parameters are shared across subjects and are updated dynamically with the progression of W.