Hardware Advice

I am buying a new computer, with Stan as a primary use case. I was hoping to get the community’s latest advice on which specs to prioritize.

I know in a prior thread, the consensus was: Get ~4 CPU cores so each MCMC chain has its own physical core; after that additional cores hit diminishing returns because of repeated warmup work. Also, CPU cache matters a lot.

In September 2020, is that still the right approach?

With improving MPI, map_rect, etc., does it now make sense to get more cores?

Additionally, I’ve read some speculation that running massive number of chains in parallel might start to make sense given modern hardware. I’m curious if anyone has any practical experience with such approaches on say a 64 core Threadripper?

“The availability of massively multi-chain MCMC provides new opportunities. For example, run- ning many parallel chains may enable adaptive MCMC techniques to achieve faster convergence and lower bias, or allow for low-variance estimates with relatively short chains.”

Does Stan make use of modern AVX SIMD for vectorized code? If I get a CPU with AVX 512 would Stan leverage that? AVX 256? Is there any special incantation I need to utter at install or compile time to make Stan use these? I found a prior discussion in these forums but it seemed inconclusive: Stan SIMD & Performance

Is Stan GPU support at the point where it’s practical and delivering performance improvement in everyday use? Or is it still more alpha / experimental? What GPUs tend to work best? Is vRAM more important, or number of CUDA Cores? (I assume Stan is not going to leverage the low precision throughput modern GPUs seem to be prioritizing as they cater to deep learning use cases…)

I’m doing a lot of hierarchical modeling. Data is small-to-medium (a few thousand observations at most) but the models are complicated and I’m constantly fighting bad posterior curvature even after trying all the algorithmic reparameterization tricks. So I keep getting really long NUTS trajectories (I usually have to lift max tree depth above default settings). I’m thinking that as long as Stan is using AVX 256 SIMD, I won’t get much benefit from GPUs, but it might be worthwhile if I got like 12 CPU cores and could leverage map_rect for a ~3x speedup vs 4 cores?

Thanks!

2 Likes

I got a very nice speedup (factor > 3 in sampling time) using multithreading with reduce_sum() on a specific model that was well suited to that approach. That was on a self-built (OK, assembled) PC with a 16 core CPU (Intel I9-7960X) and mostly gamer oriented parts from Newegg. For my purposes running 4 chains with 4 threads per chain made the best use of my hardware resources. I didn’t see any further improvement using more threads per chain even though that chip supports “hyperthreading.” A writeup with some benchmarks is here: https://extragalactic.blog/2020/06/04/multithreading-comes-to-stan/.

7 Likes

Awesome, thank you very much for the pointer to your very helpful blog post.

I see you went with 64 gb ram on your 16 core machine. Did you find that ram requirements increased as you used reduce_sun to parallelize within chain computation, so you needed the extra headroom? Or does simply splitting up the computation across more CPU cores not require any additional RAM?

I saw you mentioned the AMD option… if you were buying again today do you think you’d stick with Intel or go to AMD?

Thanks again.

No, that whole PC build was an exercise in intentional overkill. I don’t think RAM usage increased significantly with the multithreaded code vs. the same model and data run in rstan without multithreading.

AMD has been leading the way in increasing core counts in their consumer CPUs the last few years, so yeah they’re worth considering. I’d probably check out some enthusiast websites for multithreading benchmark results before deciding though. Good luck.

Unfortunately this is speculation largely driven to rationalize algorithms matched to existing hardware rather than hardware matched to the algorithms appropriate to accurate statistical computation.

In my experience beyond 4ish chains the best opportunity for speedup is through the parallelization of the target density calculation offered in functions like reduce_sum as other have noted. The number of useful cores then depends on the conditional structure of your Stan program, and how much of the computation can be compartmentalized.

Regarding GPUs, it’s my understanding that the latest generation of AMD’s consumer-level graphics cards (Radeon RX 5000 series with Navi architecture) doesn’t have very good OpenCL support.

Where have you heard that?

The 5000 series has OpenCL 2.0 support.

I was basing that on reports like this and this Github issue. It appears that some updates within the last few weeks may have addressed the issue, though.

The first one seemed lie a driver bug. But that was fixed.

The second one is related to ROCm. But you do not need rocm to run OpenCL on AMD GPUs. The proprietary AMDGPU-pro driver is still the preffered way and at least based on latest Phoronix benchmarks faster than ROCm.

I believe the community-contributed Tensorflow additions for AMD GPUs are based on ROCm which is why that may be a bit bigger deal. But that does not matter for Stan.

Thanks for the pointers. I’m definitely going NVIDIA. Even if Stan supports AMD perfectly, the rest of the data science / machine learning ecosystem is very much locked in to CUDA, and it would be a shame to lose access to all that software.

Thanks for the clarification. I don’t do much with GPU computation right now, but wanted to make sure I kept my options open when shopping for a new graphics card last month.

Better safe than sorry. I similarly bought 128 GB of RAM as overkill for a build with a 10980XE, but for the past few weeks have been struggling with running out of RAM when fitting QSP models. I think there are software issues involved, but have seriously been considering buying the maximum supported 256 GB (but fast 32GB dimms are unfortunately very expensive).
It’s overkill until you suddenly need it.

Because the Stan SIMD performance thread was linked, I’ll add:
As of that time, Stan didn’t really take advantage of SIMD, and almost certainly won’t use AVX512.

More cores will probably serve you better to possibly speed up the likes of reduce_sum.

Do note the newest version of Eigen does have better support for AVX512 and I’m guessing the compiler can generate it when appropriate right? Tho idt it’s a whole reason to buy a computer.

Thanks guys. For my purposes (I think low ability to utilize reduce sum style strategies) it sounds like I should generally optimize for single core (or “quad core”) performance rather than optimizing total potential throughout with very large number of cores. That would seem to suggest Intel over AMD, not just because of AVX512, but also generally faster single (quad?) core turbo ghz. Although those AMDs have gobs off cache which I also know is important.

Keep in mind that AMD is going to release Zen3 soon (there is a conference on October 8th).

The new chips are expected to bring a nice increase in IPC and clock speed, so it might be worth the wait.

It isn’t enabled by default. You’d have to add -DEIGEN_VECTORIZE_AVX512 to the compiler flags.
If you’re primarily bottle-necked by large linear algebra routines, then this may help.
I haven’t checked, but IIRC, Eigen should be able to SIMD because Stan will pack the arrays so the data is contiguous before calling eigen.

The actual computation would have to consume a substantial chunk of your runtime (and you’d have to be sure to add the aforementioned flag) to see overall speed ups because of it.

Regarding AVX512, note that only certain HEDT and Xeon chips actually have 2 FMA units per core.
Ice Lake for laptops only have 1, and I believe that’s true for tiger lake CPUs as well (but I’m not sure).
In that case, the actual fma (fused multiply add) throughput wouldn’t actually be better, so gemm (matrix multiplication) itself should actual actually be similar between CPUs with two 256-bit fma units and one 512-bit unit.
[On the plus side, Ice Lake seems to have mostly fixed the infamous issue of AVX512 downclocking.]

Like daniel_h said, I’d definitely wait until at least this Thursday (October 8th) and see AMD’s announcement.

1 Like

How come? Does your problem not amend to it?

In case you are a brms user… then there is good news: reduce_sum is available as experimental feature in the forthcoming version of brms. So it’s already merged into the master branch as of now.

Small data but complicated models with lots of bad posterior curvature so need lots of leapfrog steps per accepted sample. Which (if I understand correctly) is hard to parallelize. (And yes I’ve tried recentering.)

Bad curvature and the need for lots of leapfrogs steps should not matter here for the question how well something can take advantage of reduce_sum. What can be a problem is small data, but I have seen cases with very little data and still some gain in speed (mixture logistic regression model). The critical thing is if your likelihood is memory bound or not. Things which are memory bound are bernoulli_logit or normal where there is just not a lot to calculate in comparison to what needs to be moved around in memory. Things which are not memory bound are the Poisson or negative_binomial and the like (log-gamma functions).

2 Likes

What helps with memory bound models? More L3 cache? Faster RAM (assuming the CPU bus is not a bottleneck)? Putting model on GPU which has its own fast memory?