I have a gaming system at home I use maybe a couple hours a week; would it be of any use to the Stan devs to have remote access to a system with a NVIDIA 2080?
I’ll let the devs working on the GPU stuff answer that but, either way, thank you for the offer!
@Stevo15025 et al., would this be helpful to have access to?
V jelly of the 2080 ;-)
It could be nice for the benchmark papers. @rok_cesnovar would we want that?
How hard would it be to setup? We have a bunch of AWS money so if it’s hard I’d rather not take up your time setting it up
@mike-lawrence Thank you very much for the offer!
We might ping you from time to time when we polish new features to run a performance script so can we have performance numbers for a wider range of architectures. We currently dont have access to a 20XX GPU so that would help.
Just for the record: we usually do at least some amount of performance tests on NVIDIA GTX1060, GTX1070, Titan X, Tesla V100 (AWS) and AMD R9 Fury. Just got my hands on the AMD Radeon VII, will run performance tests today.
Cool, works for me
I crunch digits on a 8GB RTX2070 so it’s not appreciably different to the RTX2080 in terms of architecture (both Turing) but I’m also interested in running benchmarks if it’s helpful.
Just got my hands on the AMD Radeon VII, will run performance tests today.
Very interested in seeing the results for that one
Still trying to figure out who to contact at amd or Intel for an APU
The first test show that compared to the NVIDIA TitanXP the Radeon VII is roughly twice as fast at cholesky decomposition primitive (50x vs i7 @ 3.6GHz), mdivide_left_tri primitive (~100x vs i7 @ 3.6GHz) and the bernoulli glm (10-14x vs i7 @ 3.6GHz). All that at roughly half the price.
Keep in mind that I just did some simple test at some arbitrary problem sizes but it looks great and its definitely a bargain if you are interested in running Stan on the GPU. This also confirms that going with OpenCL was the right choice.
That’s really good news. I take it that’s just plugging in the new GPU without changing code.
Is this a new Radeon product? Any idea if benchmarks other than ours show it outperforming the TitanXP? Or if the difference could be in the OpenCL implementations?
seems to rank them about the same. (It was just first like I found searching.)
A lot of those benchmarks use floats so it’s a little hard to compare for our use case, the article below seems to cover half floats, floats and doubles. I copy-pasted the fp64 graphs below. They don’t test against a v100. imo I’m sure the v100 would beat it, but a v100 is 8K vs like max 800 bucks for the VII
Yes, apart from installing the drivers (switching from Nvidia drivers to AMD), it was plug and MCMC.
Yes, the release of Radeon VII was in February 2019, though the GPU architecture family (Vega or GCN) has been around for two years. The high-end compute-only GPUs in the same architecture family are the Radeon Instinct MI50 and MI60(launched in Nov 2018), both in the 2000$+ range.
AMD is also holding a launch of their new GPU architecture on the 7th of July.
@Stevo15025 already anwsered this one. A good number to get a feel is the theoretical double precision performance listed here. For instance the Radeon VII has a theoretical fp64 performance of 3.360 TFLOPS while the Titan XP is listed as having 379.7 GFLOPS of fp64 performance. The Nvidia V100 has ~7 TFLOPS, the AMD Instinct MI50 6.7 TFLOPS.
The details of 2 new AMD GPUs that will be launched on 7/7 are also out: RX5700 (~380$) will supposedly have a fp64 performance of 468.0 TFLOPS and the XT variant (450$) with 560 TFLOPS. Both will also support PCI 4.0 which should mean less of a penalty for data transfers (once PCI 4.0 is more widely supported).
The difference in GFLOPS is not everything off course. The size of the GPU global memory and its speed is also a factor in compute performance. Keep in mind that with our current approach the bottlenecks are actually the CPU <-> GPU data transfers that are limited with the speed of the PCI bus. Hence a 10x difference in FLOPS is not going to results in 10x speedups. Not to mention the fact that
mdivide_left_tri are not your typical embarrassingly parallel GPU problems where FLOPS are everything.
From the benchmarks we have and based on the theoretical performance, I would currently recommend the Radeon VII for anyone wanting to run Stan on the GPU. And for anyone with a smaller budget I would hold off for the 7/7 launch as the 400$ GPUs with Titan XP FP64 performance numbers seem like a huge bargain. If you also want to run games or do fp32 or fp16 deep learning thats a more difficult question.
I am in no way trying to bash NVIDIA, they have been very kind to us and they certainly have their strengths. Its just that AMD is currently more focused on fp64 than NVIDIA. And fp64 performance is needed for Stan.
Thanks—that make sense given how applications are written and GPUs optimized.
This is really great news that they’re concentrating on double-precision.
I think we should evaluate where that’s the case. I’m guessing that we could probably get away with single-precision arithmetic in GLMs like logistic regression.