I am buying a new computer, with Stan as a primary use case. I was hoping to get the community’s latest advice on which specs to prioritize.
I know in a prior thread, the consensus was: Get ~4 CPU cores so each MCMC chain has its own physical core; after that additional cores hit diminishing returns because of repeated warmup work. Also, CPU cache matters a lot.
In September 2020, is that still the right approach?
With improving MPI, map_rect, etc., does it now make sense to get more cores?
Additionally, I’ve read some speculation that running massive number of chains in parallel might start to make sense given modern hardware. I’m curious if anyone has any practical experience with such approaches on say a 64 core Threadripper?
“The availability of massively multi-chain MCMC provides new opportunities. For example, run- ning many parallel chains may enable adaptive MCMC techniques to achieve faster convergence and lower bias, or allow for low-variance estimates with relatively short chains.”
Does Stan make use of modern AVX SIMD for vectorized code? If I get a CPU with AVX 512 would Stan leverage that? AVX 256? Is there any special incantation I need to utter at install or compile time to make Stan use these? I found a prior discussion in these forums but it seemed inconclusive: Stan SIMD & Performance
Is Stan GPU support at the point where it’s practical and delivering performance improvement in everyday use? Or is it still more alpha / experimental? What GPUs tend to work best? Is vRAM more important, or number of CUDA Cores? (I assume Stan is not going to leverage the low precision throughput modern GPUs seem to be prioritizing as they cater to deep learning use cases…)
I’m doing a lot of hierarchical modeling. Data is small-to-medium (a few thousand observations at most) but the models are complicated and I’m constantly fighting bad posterior curvature even after trying all the algorithmic reparameterization tricks. So I keep getting really long NUTS trajectories (I usually have to lift max tree depth above default settings). I’m thinking that as long as Stan is using AVX 256 SIMD, I won’t get much benefit from GPUs, but it might be worthwhile if I got like 12 CPU cores and could leverage map_rect for a ~3x speedup vs 4 cores?
Thanks!