Obviously I can’t go wrong by going as expensive as possible with the Ryzen 7950X3D. But I don’t feel like spending $800 on 16 cores when I’m only running 4-5 chains, and half as many cores will do fine. Wondering what the best balance is between price and performance.
The high cache thing is particularly interesting. After seeing that cache made no difference at all in the Chromium code compile benchmarks I thought it wouldn’t matter here either .I guess I should go for AMD’s latest round of X3D chips then.
I’d be very interested to see a benchmark comparison between the 7950X and the 7950X3D. The 3D cache feels like it could substantially improve Stan’s sampling speed.
And also one vs the 13900K, since it has the same number of threads but more physical cores.
Same. I’d also like to know how much of a role RAM speed and latency plays; in one Hardware Unboxed video I saw recently, having faster RAM (6000 MHz DDR5) gave between modest and substantial performance uplifts in gaming, depending on the game, for Ryzen 7000 chips.
Shame there’s no real benchmarks I can find anywhere. The suggestion just seems to be “get the fastest, most expensive thing with all the cores, most cache, and highest boost clocks.” Would love to know where the best performance per dollar ends up though…having a hard time justifying $800 for 7950X3D over the cheaper 7900X3D with no data.
This requires enough memory bandwidth, so the bus speed and caching are critical. This is why the ARM chips are so good for this kind of thing—faster and wider memory/CPU connection.
This depends on how the data’s organized. If you have a model that has 500MB of data and you hit it randomly in the model, that’s going to be a lot of memory pressure due to cache misses. On the other hand, if you have 500MB of data and access it strictly sequentially, it won’t induce a lot of cache misses, but might be a problem with too much parallelism just due to data quantity and bus contention.
That’s not a suggestion! On the other hand, cache and CPU tend to grow together on chips and it’s hard to get one without the other. What you’ll find is that if you have 16 CPUs on a traditional front-side bus memory architecture, you’ll be bottlenecked in memory—the 16 CPUs will spend all their time waiting for the memory to take turns merging into the cache, just like a traffic jam merging onto an expressway.
I have a a 4-year old iMac with 8 physical 3.2 GHz Xeon X cores and 64 GB of 2666 MHz DDR4. This is still relatively fast memory, but it bottlenecks at about 4 chains of Stan. That is, running 8 chains in parallel takes almost as long as running 4 chains until they stop then running 4 more. This is all because of memory contention.
I think the “suggestion” is also assuming that because someone asked the question they are not satisfied with wall time in fitting their models. On the other hand, many models can fit fast and fine with older compute. So I’d buy based on actual need. If you’re just learning and exploring this can be great. It’s more about once you aren’t satisfied with wall time, understanding the hardware aspects that Bob summarised well.