Donating GPU time

mike-lawrence · June 6, 2019, 2:14pm

I have a gaming system at home I use maybe a couple hours a week; would it be of any use to the Stan devs to have remote access to a system with a NVIDIA 2080?

jonah · June 6, 2019, 10:48pm

I’ll let the devs working on the GPU stuff answer that but, either way, thank you for the offer!

@stevebronder et al., would this be helpful to have access to?

stevebronder · June 7, 2019, 5:28am

V jelly of the 2080 ;-)

It could be nice for the benchmark papers. @rok_cesnovar would we want that?

How hard would it be to setup? We have a bunch of AWS money so if it’s hard I’d rather not take up your time setting it up

rok_cesnovar · June 7, 2019, 9:54am

@mike-lawrence Thank you very much for the offer!

We might ping you from time to time when we polish new features to run a performance script so can we have performance numbers for a wider range of architectures. We currently dont have access to a 20XX GPU so that would help.

Just for the record: we usually do at least some amount of performance tests on NVIDIA GTX1060, GTX1070, Titan X, Tesla V100 (AWS) and AMD R9 Fury. Just got my hands on the AMD Radeon VII, will run performance tests today.

mike-lawrence · June 7, 2019, 12:57pm

Cool, works for me

increasechief · June 7, 2019, 1:15pm

I crunch digits on a 8GB RTX2070 so it’s not appreciably different to the RTX2080 in terms of architecture (both Turing) but I’m also interested in running benchmarks if it’s helpful.

stevebronder · June 7, 2019, 2:55pm

Just got my hands on the AMD Radeon VII, will run performance tests today.

🙌🙌🙌

Very interested in seeing the results for that one

Still trying to figure out who to contact at amd or Intel for an APU

rok_cesnovar · June 8, 2019, 1:01pm

The first test show that compared to the NVIDIA TitanXP the Radeon VII is roughly twice as fast at cholesky decomposition primitive (50x vs i7 @ 3.6GHz), mdivide_left_tri primitive (~100x vs i7 @ 3.6GHz) and the bernoulli glm (10-14x vs i7 @ 3.6GHz). All that at roughly half the price.

Keep in mind that I just did some simple test at some arbitrary problem sizes but it looks great and its definitely a bargain if you are interested in running Stan on the GPU. This also confirms that going with OpenCL was the right choice.

Bob_Carpenter · June 12, 2019, 5:39pm

That’s really good news. I take it that’s just plugging in the new GPU without changing code.

Is this a new Radeon product? Any idea if benchmarks other than ours show it outperforming the TitanXP? Or if the difference could be in the OpenCL implementations?

This:

https://gpu.userbenchmark.com/Compare/Nvidia-Titan-Xp-vs-AMD-Radeon-VII/m265423vs4035

seems to rank them about the same. (It was just first like I found searching.)

stevebronder · June 12, 2019, 7:14pm

A lot of those benchmarks use floats so it’s a little hard to compare for our use case, the article below seems to cover half floats, floats and doubles. I copy-pasted the fp64 graphs below. They don’t test against a v100. imo I’m sure the v100 would beat it, but a v100 is 8K vs like max 800 bucks for the VII

rok_cesnovar · June 12, 2019, 8:26pm

Yes, apart from installing the drivers (switching from Nvidia drivers to AMD), it was plug and MCMC.

Yes, the release of Radeon VII was in February 2019, though the GPU architecture family (Vega or GCN) has been around for two years. The high-end compute-only GPUs in the same architecture family are the Radeon Instinct MI50 and MI60(launched in Nov 2018), both in the 2000$+ range.

AMD is also holding a launch of their new GPU architecture on the 7th of July.

@stevebronder already anwsered this one. A good number to get a feel is the theoretical double precision performance listed here. For instance the Radeon VII has a theoretical fp64 performance of 3.360 TFLOPS while the Titan XP is listed as having 379.7 GFLOPS of fp64 performance. The Nvidia V100 has ~7 TFLOPS, the AMD Instinct MI50 6.7 TFLOPS.

The details of 2 new AMD GPUs that will be launched on 7/7 are also out: RX5700 (~380$) will supposedly have a fp64 performance of 468.0 TFLOPS and the XT variant (450$) with 560 TFLOPS. Both will also support PCI 4.0 which should mean less of a penalty for data transfers (once PCI 4.0 is more widely supported).

The difference in GFLOPS is not everything off course. The size of the GPU global memory and its speed is also a factor in compute performance. Keep in mind that with our current approach the bottlenecks are actually the CPU ↔ GPU data transfers that are limited with the speed of the PCI bus. Hence a 10x difference in FLOPS is not going to results in 10x speedups. Not to mention the fact that cholesky_decompose and mdivide_left_tri are not your typical embarrassingly parallel GPU problems where FLOPS are everything.

From the benchmarks we have and based on the theoretical performance, I would currently recommend the Radeon VII for anyone wanting to run Stan on the GPU. And for anyone with a smaller budget I would hold off for the 7/7 launch as the 400$ GPUs with Titan XP FP64 performance numbers seem like a huge bargain. If you also want to run games or do fp32 or fp16 deep learning thats a more difficult question.

I am in no way trying to bash NVIDIA, they have been very kind to us and they certainly have their strengths. Its just that AMD is currently more focused on fp64 than NVIDIA. And fp64 performance is needed for Stan.

Bob_Carpenter · June 18, 2019, 4:32pm

Thanks—that make sense given how applications are written and GPUs optimized.

This is really great news that they’re concentrating on double-precision.

I think we should evaluate where that’s the case. I’m guessing that we could probably get away with single-precision arithmetic in GLMs like logistic regression.

rok_cesnovar · November 14, 2019, 8:02am

Intel is getting closer to releasing their first high-performance GPU and the press-release bears great news for Stan GPU, specifically this tidbit:

What’s new is that they confirmed that the Xe graphics feature: ultra-high cache and high memory bandwidth . Intel Ponte Vecchio will also have high double-precision FP throughput .

I was hoping Intel goes this route and they did. I have no idea on the price range but looks promising. Also Ponte Vecchio is an awesome name :)

EDIT: And another update to make this post less off-topic. The first “request for performance tests” is coming in the next few days focusing on the freshly-optimized GPU GLMs.

mike-lawrence · November 14, 2019, 12:57pm

Neat. I recommend sending said request with instructions for getting GPUs set up for cmdstan and maybe the cmdstanpy and cmdstanr interfaces.

rok_cesnovar · November 15, 2019, 8:28am

Yeah, the plan is to use cmdstanr, I have most of the stuff ready, just waiting for a few PRs to go through and need to go throught the installation wiki.

If I have some additional time I will also try to prepare cmdstanpy scripts.

collin_cademartori · December 25, 2019, 10:30pm

Sorry to reopen this, but I’m in the market for a graphics card, and I’m wondering if the release of the 5700xt has changed this advice at all. In particular, I’m trying to figure out which would have a bigger impact on the performance of Stan on the GPU: the 16gb of HBM2 memory on the Radeon VII or the PCIe gen 4 interface of the 5700XT?

rok_cesnovar · December 26, 2019, 12:37pm

Hey, I havent had the chance to try a 5700XT but I would say that the PCI 4.0 will definitely have a bigger effect on the performance. If you have a PCIe 4.0 motherboard I would advise you to go with 5700XT. At that price its a bargain.

Topic		Replies	Views
GPU Speedup Experiences & Hardware General gpu	5	1341	May 13, 2022
Another GPU question Interfaces gpu	2	453	January 17, 2021
Inverse Speedups on GPU Developers math	4	628	October 29, 2018
Looking for an advice for a laptop onto Stan will run General	1	246	January 11, 2024
Lightmatter CPUs and Stan General	4	394	August 5, 2021

Donating GPU time

Related topics