Model with reduce_sum takes too long

Hello all. After implementing the within-chain parallelization with reduce_sum in cmdstanpy (with your help), I reached a dead end. My model is pretty complicated (attached) and I fit a lot of data, and even if I run it on many cores (now 25) it takes unreasonably long time to fit (it’s stuck in warmup for two days already). I tried different ways to parallelize and improve the process, but to no avail and I am out of options.
Any suggestion how to improve it, to make it faster or anything will be highly appreciated.
I am using cmdstanpy with cmdstan 2.24, centOS7, g++ 9.3.0.

Thank you!

model_parallel.txt (18.7 KB)

UPDATE - how big is my model? I have ~25k parameters and a bit more than 2 million observations.

That’s huge! Did you measure your speed ups on smaller problem sizes first? Do you have a sense for how long this will take with a single core?

Yes it is :) I was building it during the last few months. I was running it on small N of 8-10 on a similar number of cores and it finished within 20 hours. N=20 on 20 cores scaled up to 40 hours. And these are just small portions of my data. Single core processing of N=15 took ~70 hours for the fastest chain (6 days for the last chain). So I think something can be done better with the parallelization, but I am really not sure what. Vectorizing the loops within reduce_sum didn’t give much speed gain.

Usually vectorization is super important. If that does not help, then things are odd to me.

Anyway, you should try to

  • slice parameters used per group; so if your model is a hierachical one, then slice all the parameters needed per group
  • parameters in the shared arguments list is very expensive, since it’s copied for every call of the partial sum function (hence, slice the parameters as appropriate for your model).
  • having data in the share arguments is cheap, because its handled by reference (no copy)

Other than that make sure the model runs efficient without reduce_sum first, then go to reduce_sum… I hope that is what happened in your case.

You mean I should pass all the arrays I want to slice before I pass grainsize in the partial_sum?

Based on the tests I run, the serial version would take more than a month for the full dataset. I am not sure what counts as efficient here, but this is what I;ve got after optimizing each part of my model separately. That’s where I started to implement multithreading.

I guess not always.

This version with a for loop:

for(i in 1:N) {
   y[i] ~ normal(mu[i], sigma);

is massively slower as opposed to

  y ~ normal(mu, sigma);

Now, building up a large mu vector using a for loop is ok, but definitely not calling out by element to the lpdfs.

I vectorized these loops where possible in the partial_sum function, but didn’t see significant speedup. This was odd to me, but I left it vectorized.

you mean something like this?

for (w in 1:W) {
    target += reduce_sum(partial_sum, parameters1[,w], parameters2[,w], grainsize, data1[,w], data2[,w])	

I wish this was possible, but it goes against partial_sum function signature.

No. Never call many reduce sum like this if you can avoid it.

Can u annotate your model to see where the 25k params and all that data is? Somehow easier to read model would be nicer.

Now I see that you do have a for loop calling multiple reduce sum…that’s not efficient. As a workaround you could try to nest reduce sum calls as a start, but it would be a lot better to have one single layer of reduce sum only.

1 Like

Here is the annotated model. Hope it is clearer. The general design is that participants (N) perform 7 different tasks (each with a well defined computational model) during multiple weeks (W).

model_parallel.txt (19.4 KB)

Yes, this was one of my latest tries that also was not successful. Initially I parallelized over N by using reduce_sum_static with grainsize=1 - that is each participant will get a core. But it also was very slow.

Now I see that you meant the loop over W . This is unavoidable bc I am modeling a dynamical process. So there is no solution to it?

How big is N and W?

I would expect that each subject (out of N) is independent of all the others, right? It‘s just that per subject things are correlated over time, the weeks. Is that how your model works?

yes, exactly. Except for dynamic variables (eg., X) that are shared across subjects. My N is ~100 and W is ~10. You wrote earlier to use a nested version of reduce_sum is an option. What did you mean by it?

Then you should really rewrite the model such that you form partial sums over subjects. So how about this plan:

  • rewrite the model so that it is of the form
    for( i in 1:N ) target += subject_log_lik(... subject i ...);

  • then you write a function forming partial sums over multiple patients

  • finally you use use reduce_sum

As a last step you can try to slice all those parameters per subject. So the first argument which gets sliced becomes an array of dimension a[N,…] and this contains for a given subject all your parameters needed which are specific to the subject to calculate his log-lik.


Sorry for being so slow… You mean I should put the loop over W inside this

for( i in 1:N ) target += subject_log_lik(... subject i ...); ?

If yes, I am not sure I can do that. Some of the parameters are subject-specific, but some parameters are shared across subjects and are updated dynamically with the progression of W.