Hello again. My struggle with within-chain parallelization using reduce_sum is unfortunately ramping up. I moved to using CmdStanR (after CmdStanPy did not work with multi threading), but when I run this model using reduce_sum I have two problems:
(1) the model with reduce_sum takes twice as much time as the model without reduce_sum.
(2) even though I designed the model to have two threads and defined threads_per_chain = 2, I see that I have 3 running threads. Maybe these problems are related.
I am really stuck…
Hi @nerpa. It will help if we can read your partial_sum function (or ideally show a minimal working example), along with your R code that calls the program, including what you set as the grainsize. Further, it is generally best to a) place as much of the program inside the partial_sum function(s) and b) move things like
X[n,w] ~ normal(mu_prior_x,sigma_v);
outside loops in favor of summing over a vector and incrementing target. Also, what hardware are you trying to use for threading?
grainsize is hardcoded at this level to be 13 (it’s in the model part I attached) since the overall number of trials is 26 and I am testing with 2 threads.
UPDATE - I use macbook with 16GB 2.9 GHz Intel Core i7.
I wish I could move things like X[n,w] ~ normal(mu_prior_x,sigma_v); but I am coding an evolving process in time (w) so I am not sure if that’s possible to move it out.
Sorry I missed that you defined it in the model block, I was expecting that to be data. You can move
int grainsize = 13;
to transformed data and it will only be called once when the model first runs, instead of every iteration within both loops, for example.
By trials, do you mean you only have 26 rows of data? The function bernoulli_logit_lpmf is already highly optimized and the reduce_sum function has some overhead for parameters, so on balance, it may or may not be faster, but if your model is mostly running code outside that function, then you may not get much speedup anyway because the code outside reduce_sum is the limiting factor + the overhead in the function call, and you can search discourse for previous discussions on this.
yes, i will move int grainsize = 13; to transformed data, thanks.
My test data has 26 rows, but the actual data is much larger + I have more models that I fit in every N-W loop with different parameters and I hoped reduce_sum will speed things up.
By slicing, you mean that I could slice X instead of looping over it?
That’s helpful and encouraging! I will try with larger data sets.
What else could I slice here? At this point, my full model with a subset of subjects (n=15) fits for a week. So every suggestion how to speed things up will be highly appreciated.
I have one more question - does reduce_sum really wait until all shards are being processed before moving forward? It is still unclear why I saw 3 threads running instead of 2.
It’s not just slicing. This is getting off-topic, but I would focus on building the model piece by piece, trying to apply the various concepts in the user guide, and post a different question with the full model (it can be hard to diagnose efficiency issues without seeing how the whole model is coded and fits together; even the direction of indexing in arrays versus matrices can give speedups), if you can, for efficiency advice.
Hard to say with the information here … the way you specified the call, it should have 4 total threads, two per chain.
Thank so much for keeping up with this! Do you have any rough estimate when it might work? (I really don’t like R so far :)
Btw, I read your filed note, and multithreading doesn’t work in cmdstanpy even when I set
`os.environ[‘STAN_NUM_THREADS’]. Hope this helps!
chains are run via Python’s subprocess module - setting the environment variable isn’t enough - according to SO, it needs to be specified in the call to Popen: https://stackoverflow.com/a/20669704