Hardware Advice

I am buying a new computer, with Stan as a primary use case. I was hoping to get the community’s latest advice on which specs to prioritize.

I know in a prior thread, the consensus was: Get ~4 CPU cores so each MCMC chain has its own physical core; after that additional cores hit diminishing returns because of repeated warmup work. Also, CPU cache matters a lot.

In September 2020, is that still the right approach?

With improving MPI, map_rect, etc., does it now make sense to get more cores?

Additionally, I’ve read some speculation that running massive number of chains in parallel might start to make sense given modern hardware. I’m curious if anyone has any practical experience with such approaches on say a 64 core Threadripper?

“The availability of massively multi-chain MCMC provides new opportunities. For example, run- ning many parallel chains may enable adaptive MCMC techniques to achieve faster convergence and lower bias, or allow for low-variance estimates with relatively short chains.”

Does Stan make use of modern AVX SIMD for vectorized code? If I get a CPU with AVX 512 would Stan leverage that? AVX 256? Is there any special incantation I need to utter at install or compile time to make Stan use these? I found a prior discussion in these forums but it seemed inconclusive: Stan SIMD & Performance

Is Stan GPU support at the point where it’s practical and delivering performance improvement in everyday use? Or is it still more alpha / experimental? What GPUs tend to work best? Is vRAM more important, or number of CUDA Cores? (I assume Stan is not going to leverage the low precision throughput modern GPUs seem to be prioritizing as they cater to deep learning use cases…)

I’m doing a lot of hierarchical modeling. Data is small-to-medium (a few thousand observations at most) but the models are complicated and I’m constantly fighting bad posterior curvature even after trying all the algorithmic reparameterization tricks. So I keep getting really long NUTS trajectories (I usually have to lift max tree depth above default settings). I’m thinking that as long as Stan is using AVX 256 SIMD, I won’t get much benefit from GPUs, but it might be worthwhile if I got like 12 CPU cores and could leverage map_rect for a ~3x speedup vs 4 cores?



I got a very nice speedup (factor > 3 in sampling time) using multithreading with reduce_sum() on a specific model that was well suited to that approach. That was on a self-built (OK, assembled) PC with a 16 core CPU (Intel I9-7960X) and mostly gamer oriented parts from Newegg. For my purposes running 4 chains with 4 threads per chain made the best use of my hardware resources. I didn’t see any further improvement using more threads per chain even though that chip supports “hyperthreading.” A writeup with some benchmarks is here: https://extragalactic.blog/2020/06/04/multithreading-comes-to-stan/.


Awesome, thank you very much for the pointer to your very helpful blog post.

I see you went with 64 gb ram on your 16 core machine. Did you find that ram requirements increased as you used reduce_sun to parallelize within chain computation, so you needed the extra headroom? Or does simply splitting up the computation across more CPU cores not require any additional RAM?

I saw you mentioned the AMD option… if you were buying again today do you think you’d stick with Intel or go to AMD?

Thanks again.

No, that whole PC build was an exercise in intentional overkill. I don’t think RAM usage increased significantly with the multithreaded code vs. the same model and data run in rstan without multithreading.

AMD has been leading the way in increasing core counts in their consumer CPUs the last few years, so yeah they’re worth considering. I’d probably check out some enthusiast websites for multithreading benchmark results before deciding though. Good luck.

Unfortunately this is speculation largely driven to rationalize algorithms matched to existing hardware rather than hardware matched to the algorithms appropriate to accurate statistical computation.

In my experience beyond 4ish chains the best opportunity for speedup is through the parallelization of the target density calculation offered in functions like reduce_sum as other have noted. The number of useful cores then depends on the conditional structure of your Stan program, and how much of the computation can be compartmentalized.

Regarding GPUs, it’s my understanding that the latest generation of AMD’s consumer-level graphics cards (Radeon RX 5000 series with Navi architecture) doesn’t have very good OpenCL support.

Where have you heard that?

The 5000 series has OpenCL 2.0 support.

I was basing that on reports like this and this Github issue. It appears that some updates within the last few weeks may have addressed the issue, though.

The first one seemed lie a driver bug. But that was fixed.

The second one is related to ROCm. But you do not need rocm to run OpenCL on AMD GPUs. The proprietary AMDGPU-pro driver is still the preffered way and at least based on latest Phoronix benchmarks faster than ROCm.

I believe the community-contributed Tensorflow additions for AMD GPUs are based on ROCm which is why that may be a bit bigger deal. But that does not matter for Stan.

Thanks for the pointers. I’m definitely going NVIDIA. Even if Stan supports AMD perfectly, the rest of the data science / machine learning ecosystem is very much locked in to CUDA, and it would be a shame to lose access to all that software.

Thanks for the clarification. I don’t do much with GPU computation right now, but wanted to make sure I kept my options open when shopping for a new graphics card last month.