Best CPU for RStan/brms

Longshot408 · March 21, 2023, 12:05am

Hi all, just a quick question about recommendations for CPU’s. I’m looking into building some new computers for my research lab and I’m having trouble finding benchmarks for CPU’s to help me decide. So far I’ve seen the Chromium Code Compile benchmark in Gamer’s Nexus reviews (e.g., AMD Ryzen 9 7900X3D CPU Review & Benchmarks: Spoiled by the 5800X3D - YouTube), and the Microsoft Visual Studio C++ benchmark on TechPowerUp (AMD Ryzen 9 7950X3D Review - Best of Both Worlds - Software & Game Development | TechPowerUp). Are either of these representative of R generally, or running brms models specifically?

Obviously I can’t go wrong by going as expensive as possible with the Ryzen 7950X3D. But I don’t feel like spending $800 on 16 cores when I’m only running 4-5 chains, and half as many cores will do fine. Wondering what the best balance is between price and performance.

franzsf · March 21, 2023, 5:30pm

There are some good posts (perhaps dated) here:

I think the consensus revolved around (1) high cache (i.e., L3) on the CPU, (2) moar cores, (3) RAM.

I’ve been impressed with the speedups from within-chain parallelization in brms, so even if you’re running 4 chains, additional cores help.

Longshot408 · March 21, 2023, 7:06pm

Thank you!!

The high cache thing is particularly interesting. After seeing that cache made no difference at all in the Chromium code compile benchmarks I thought it wouldn’t matter here either .I guess I should go for AMD’s latest round of X3D chips then.

ma-riviere · March 26, 2023, 11:37am

I’d be very interested to see a benchmark comparison between the 7950X and the 7950X3D. The 3D cache feels like it could substantially improve Stan’s sampling speed.

And also one vs the 13900K, since it has the same number of threads but more physical cores.

Longshot408 · March 26, 2023, 12:34pm

Same. I’d also like to know how much of a role RAM speed and latency plays; in one Hardware Unboxed video I saw recently, having faster RAM (6000 MHz DDR5) gave between modest and substantial performance uplifts in gaming, depending on the game, for Ryzen 7000 chips.

Shame there’s no real benchmarks I can find anywhere. The suggestion just seems to be “get the fastest, most expensive thing with all the cores, most cache, and highest boost clocks.” Would love to know where the best performance per dollar ends up though…having a hard time justifying $800 for 7950X3D over the cheaper 7900X3D with no data.

Bob_Carpenter · December 10, 2024, 8:01pm

This requires enough memory bandwidth, so the bus speed and caching are critical. This is why the ARM chips are so good for this kind of thing—faster and wider memory/CPU connection.

This depends on how the data’s organized. If you have a model that has 500MB of data and you hit it randomly in the model, that’s going to be a lot of memory pressure due to cache misses. On the other hand, if you have 500MB of data and access it strictly sequentially, it won’t induce a lot of cache misses, but might be a problem with too much parallelism just due to data quantity and bus contention.

That’s not a suggestion! On the other hand, cache and CPU tend to grow together on chips and it’s hard to get one without the other. What you’ll find is that if you have 16 CPUs on a traditional front-side bus memory architecture, you’ll be bottlenecked in memory—the 16 CPUs will spend all their time waiting for the memory to take turns merging into the cache, just like a traffic jam merging onto an expressway.

I have a a 4-year old iMac with 8 physical 3.2 GHz Xeon X cores and 64 GB of 2666 MHz DDR4. This is still relatively fast memory, but it bottlenecks at about 4 chains of Stan. That is, running 8 chains in parallel takes almost as long as running 4 chains until they stop then running 4 more. This is all because of memory contention.

ssp3nc3r · December 11, 2024, 4:27am

I think the “suggestion” is also assuming that because someone asked the question they are not satisfied with wall time in fitting their models. On the other hand, many models can fit fast and fine with older compute. So I’d buy based on actual need. If you’re just learning and exploring this can be great. It’s more about once you aren’t satisfied with wall time, understanding the hardware aspects that Bob summarised well.

George_GL · December 30, 2024, 1:01pm

Hello everyone,

I have four PCs at home from various generations, so I decided to run a task on each and measure the execution time. While this is by no means a definitive test, it should give a general impression of what these CPUs can do with brms.

The M1 Air in my test has 16 GB of RAM, the 16-core Ryzen system has 128 GB, and the M4Pro has 24 GB. The data and syntax for running the analysis on your own machine are attached. The data and model are essentially a modified version of those published here.

The single is just letting chain = 1, cores = 1, with the resulting difference being just the lost overhead in computing and putting it all together.

I am particulary impressed with the M4Pro from Apple, as it bests my 12 core monster with 126Gb Ram, all in a portable format.

benchmark.csv (1.2 MB)
benchmark.R (1.3 KB)

Happy New Year!
G.

andre.pfeuffer · December 31, 2024, 6:36am

What are your makefile settings? Have you used some compiler optimization?
On my 5800H Notebook with 28W power consumption limit and makefile settings
CFLAGS+=-fPIC -O3 -mtune=native -march=native -DEIGEN_USE_BLAS -DEIGEN_USE_LAPACKE
it runs in 84s, which I haven’t expected to be that fast.

George_GL · January 1, 2025, 9:51pm

Ah no, just vanilla stuff. The lastest versions of the packages and that is all. I am not familiar with such mods like makefile settings. Should I?

andre.pfeuffer · January 2, 2025, 6:17am

Speed up is significant as shown here:

It’s also important to know the compiler version, operating system and system.
Adding CFLAGS+=-fPIC -O3 -mtune=native -march=native in file make/local
forcing a rebuild if Stan with cmdstand_rebuild() or similar in other environments should be sufficient.

Bob_Carpenter · January 5, 2025, 7:09pm

It’s so nice to see a concrete report. Thanks so much for reporting this @George_GL.

avehtari · February 12, 2025, 4:54pm

I’ll add the link to the blog post collecting the information about the different options for improved speed Options for improving Stan sampling speed – The Stan Blog

jonah · February 12, 2025, 7:34pm

@avehtari what do you think about linking to your blog post from the CmdStanR documentation?

avehtari · February 12, 2025, 8:03pm

Sure. It’s on mc-stan.org, so it should be stable. Would be good to check that ally advise are still valid

Topic		Replies	Views
Buying a new computer - best hardware for fast Stan performance General	24	7483	May 21, 2020
Hardware Advice General	21	3099	October 8, 2020
Buying a new laptop, what should I get? General	13	2284	May 24, 2020
In system(paste(cmd, "-n")) : 'make' not found. And slow speed General performance	17	5821	March 12, 2020
Is it impossible to compute multiple models in parallel using brms with `backend = "cmdstanr"`? brms cmdstanr , paralellization , brms	7	1648	July 25, 2022

Best CPU for RStan/brms

Related topics