Stan significantly slower after incorporating multithreading?

After converting my model block to include target += reduce_sum() calls and compiling the model in cmdstanr with cpp_options = list(stan_threads = TRUE) (and setting grainsize=1 in the model block), fitting takes significantly longer. Has anyone had this experience? Without going into the details of the model, are there common problems I may be overlooking?

Thank you all for your time.

EDIT: this is covered in a number of standard discussions of within-chain parallelization. Models with large numbers of parameters and simple likelihoods are not conducive to threading. This is the main reason why I’m not seeing the benefit. Thanks everyone.

Thats 1 element per thread so if you are only doing something like a non expensive lpdf inside the reduce sum call that is going to make a lot of overhead per thread spinup. Have you tried setting the grain size much higher? Note you do need a good bit of operations per group to see a benefit from parallelising the lpdf

I was under the impression that grainsize = 1 indicated that the grainsize is chosen automatically (see here). I added some print statements in the partial sum function and found the groups were sized around 3 or 4. But your point about operations per group is well taken. In fact, I found this discussion which covers exactly what you mentioned.

Maybe have a look here:

1 Like

Thank you for the link. I still don’t have intuition about exactly when the overhead outweighs parallelization. The bottom line is that I need to iterate on my model if I want to experience the benefits of threading.

just compare the model performance to the case of no threading at all. So replace the reduce_sum call by a call to the partial log-lik evaluating the full thing at once. If you beat this runtime with multiple cores in use, then you are good.

I see many people getting carried away from optimizing the grainsize. That’s not spend resource in a wise way. Just get the grainsize roughly ok in a way so that you speedup. All you need is to get to a point where you can work efficiently. Either by brute force (within-chain threading), subsetting your data cleverly, tuning the model or model things better. I warn most people from over-engineering “grainsize”. Do not forget the other factors. Have you already profiled your likelihood evaluation so that you know what you are paying your performance for?


idt there is an exact answer that doesn’t require a lot of work for every particular function and compute environment. For a start you need to benchmark how much it costs to setup the multi-threaded environment like the cost of copying any data necessary for individual threads, launching threads, shutting down threads, collecting threads results.

I’d follow @wds15’s answer ^ to test it vs. no threading and try a handful of grain sizes.