Have you been using some of the latest features of Stan?

We rely heavily on reduce_sum. Prior to reduce_sum, our standard models would take about 4-5 days. This was too long for our modeling production cycle. So we kept using custom Gibbs samplers instead (1-2 days). Now with reduce_sum running 8 threads per chain, we see about 5x speedup over the non reduce_sum. Our sampling is now slightly faster than our Gibbs samplers and Stan converges much better. We get slightly faster estimations with 12 threads and for larger problems with 16-20, but rarely much after that. And of course too many threads eventually is slower. We have also tuned the grainsize faster than the Stan auto-default. For our standard problems, each thread runs about 4 loops for one iteration (8 threads = 32 data chunks).

All this is based on desktops with AMD threadripper 3970X (WSL), and AWS compute optimized instances (c5.4xlarge) running cmdstanr. I can’t thank Stan developers enough for reduce_sum. It took Stan from a non-production nice idea to a practical tool.
https://discourse.mc-stan.org/t/cmdstan-2-23-release-candidate-is-available/14301

5 Likes