Stan significantly slower after incorporating multithreading?

After converting my model block to include target += reduce_sum() calls and compiling the model in cmdstanr with cpp_options = list(stan_threads = TRUE) (and setting grainsize=1 in the model block), fitting takes significantly longer. Has anyone had this experience? Without going into the details of the model, are there common problems I may be overlooking?

Thank you all for your time.

EDIT: this is covered in a number of standard discussions of within-chain parallelization. Models with large numbers of parameters and simple likelihoods are not conducive to threading. This is the main reason why I’m not seeing the benefit. Thanks everyone.

Thats 1 element per thread so if you are only doing something like a non expensive lpdf inside the reduce sum call that is going to make a lot of overhead per thread spinup. Have you tried setting the grain size much higher? Note you do need a good bit of operations per group to see a benefit from parallelising the lpdf

I was under the impression that grainsize = 1 indicated that the grainsize is chosen automatically (see here). I added some print statements in the partial sum function and found the groups were sized around 3 or 4. But your point about operations per group is well taken. In fact, I found this discussion which covers exactly what you mentioned.

Maybe have a look here:

https://cran.r-project.org/web/packages/brms/vignettes/brms_threading.html

1 Like

Thank you for the link. I still don’t have intuition about exactly when the overhead outweighs parallelization. The bottom line is that I need to iterate on my model if I want to experience the benefits of threading.

just compare the model performance to the case of no threading at all. So replace the reduce_sum call by a call to the partial log-lik evaluating the full thing at once. If you beat this runtime with multiple cores in use, then you are good.

I see many people getting carried away from optimizing the grainsize. That’s not spend resource in a wise way. Just get the grainsize roughly ok in a way so that you speedup. All you need is to get to a point where you can work efficiently. Either by brute force (within-chain threading), subsetting your data cleverly, tuning the model or model things better. I warn most people from over-engineering “grainsize”. Do not forget the other factors. Have you already profiled your likelihood evaluation so that you know what you are paying your performance for?

2 Likes

idt there is an exact answer that doesn’t require a lot of work for every particular function and compute environment. For a start you need to benchmark how much it costs to setup the multi-threaded environment like the cost of copying any data necessary for individual threads, launching threads, shutting down threads, collecting threads results.

I’d follow @wds15’s answer ^ to test it vs. no threading and try a handful of grain sizes.