Stan on the GPU


#1

Stan on the GPU using OpenCL/ViennaCL

We have started discussing this topic in an e-mail thread with Bob and Steve and we decide to move the discussion to the forums.

Quick intro:

Me and Erik Štrumbelj from University of Ljubljana recieved a grant for a 3 year project where we are trying to speedup algorithms for Bayesian inference on the GPU. In the first part of the project we created a R package for the execution of Bayesian lasso and Multinomial Logistic Regression on the GPU. We have gained speedups of 100 to 150-fold compared to the sequential implementations in C/C++. In the next 2 years we plan to focus on using GPUs to speedup the Stan C++ code. Our plan is to use OpenCL in order to be able to run the software on GPUs of all major vendors (NVIDIA, ATI/AMD, Intel).

Final goal: Make Stan run 100 times faster for models/data of mid to high complexity/size on mid to high-range GPUs. And, if possible, to not run slower on smaller problems.

We all seem to agree that the best approach is to extend the Stan compiler so that it compiles GPU code - OpenCL or one of its libraries (ViennaCL,… ). We generally like ViennaCL and in the short term it seems as the easiest way to getting some speedups. We do however have some concerns we should discuss:

  • How quickly can new additions be integrated - We will probably have a lot of non-standard matrix/vector operations that need to be parallelized and executed on the GPU(for instance adding to diagonal, etc …)?
  • How optimized is ViennaCL for dense matrix operations? We have created our own Cholesky decomposition that still has some room for improvement as it was done quickly. The code is currently almost twice as fast as the ViennaCL for Cholesky. (~80-fold faster compared to LLT and ~40-fold compared to LU with a AMD HD7570 (mid to high range relatively old gpu) ). I have posted our code/results on Dropbox: https://www.dropbox.com/s/foyzn6llhpifwvf/eigen_vs_vienna_dense_lu.tar.gz?dl=0. To compare I used the code Steve posted a week ago. The results are in the HD7570.txt and K20m.txt files (I am a new user so I could not attach files).

The other question we should maybe discuss is how much of the computation to move to GPU?

Parallelize only parts of the code (matrix/vector operations, embarrassingly parallel for loops…)

  • least amount of work
  • a lot of data transfers from and to the GPU

Move all posterior/gradient/transformation/generated quantities calculation to GPU?

  • only parameters need to be transferred, data can be transferred to GPU only once
  • we’d need a GPU alternative for everything that Stan does

A hybrid of the two above?

  • It would not be costly to detect during compilation what data can be permanently moved to GPU and what partially computed quantities can remain on GPU. Maybe this can be done very effectively.

Move everything to GPU (including NUTS sampler)?

  • No data transfers.
  • Data transfer benefits probably won’t justify the added complexity of parallelizing the sampler.
  • Would have to do it again for each sampler.

The third question that we had is if we are restricted to double precision arithmetic? We could for instance copy a double array to a float array and compute with floats on a GPU and then copy the results back to a double array. What is the largest error that is still acceptable? If this would be possible that would bring another 2-fold speedup on most GPUs (non-computing-only).


GPU Stan
#2

This sounds super-cool. I’m ccing Aki and Ben in case they has thoughts on this. I have no technical knowledge of any of this stuff but as a user I’m interested in 100x speedups!
Andrew


#3
  • How quickly can new additions be integrated - We will probably have a lot of non-standard matrix/vector operations that need to be parallelized and executed on the GPU(for instance adding to diagonal, etc …)?

I believe a lot of this is in ViennaCL. We could go through the ViennaCL docs and see what things Eigen can do that ViennaCL cannot

http://viennacl.sourceforge.net/doc/

How optimized is ViennaCL for dense matrix operations?

They don’t explicitly list matrix operations, but you can see some benchmarks here.

http://viennacl.sourceforge.net/viennacl-benchmarks.html

Though I would suppose that for simple things it would be reasonably optimized, we can create our own benchmarks as well.

We have created our own Cholesky decomposition that still has some room for improvement as it was done quickly.

I just looked at it and it is very cool!

The code is currently almost twice as fast as the ViennaCL for Cholesky.

This is awesome! Though looking at the outputs it seems like your version loses a bit of precision. It’s also odd that the precision is exact for the largest matrix, but is off for two of the smaller matrices. Given that your team whipped that version up pretty quickly, would you be able to make a version that maintains equal precision to the Eigen version?

A hybrid of the two above?

  • It would not be costly to detect during compilation what data can be permanently moved to GPU and what partially computed quantities can remain on GPU. Maybe this can be done very effectively.

Maybe an alternative stan-math-gpu library? That would be a lot of work, but it seems like the norm is to create a separate package for the GPU version of packages. This is a design choice that I can’t really comment on with any level of expertise.

Overall I think we could approach this in stages, first doing (1) then (2) and then testing if (3) would be useful.

The third question that we had is if we are restricted to double precision arithmetic?

@betanalpha has told me that having double precision is pretty important. I think the goal should be to maintain high precision, then later we can start testing whether we can use floats.

Your Cholesky decomposition is very impressive! Have you thought about speaking to the ViennaCL team about adding it to their library? They mention in this issue on their github that they would be very welcoming of this sort of thing. Anybody please correct me if I’m wrong, but while stan uses a lot of linear algebra I would not necessarily call it a linear algebra library. I think a big win win for everyone would be if you could have your routines integrated into ViennaCL. Then stan can still pass linear algebra to another library and your code gets a larger amount of exposure since other people will be able to use it.


#4

PRECISION

It would be a great project (lots of interest, easy route to publication) to investigate how much precision we could lose in gradient and density calculations for the leapfrog algorithm without divergence. The 1e-7 vs. 1e-14 in the final draws isn’t a big deal—it’s all about keeping the Hamiltonian simulation on track.

The precision tolerance will vary by model, data set, and parameterization of the model (if you count the latter as something separate). The thing to do would be to start with a case study.

RELEASES

I have no experience with GPU releases. Moving all the code to GPU leads to maintaining parallel versions of Stan, which is more than we can commit to given the current staffing and developer expertise.

DEPENDENCIES

I like the idea of basing what we do on someone else’s library. It drastically cuts down on our maintenance burden and we get automatic updates if it’s being maintained. I have no opinion about ViennaCL other than that it has an acceptable license, so there’s nothing stopping us from using it in core Stan C++.


#5

We were able to do just that. Below are the results for a bit more advanced implementation. The speedup compared to the ViennaCL is around 4 to 5 for large matrices, 100-160 compared to Eigen LU and 50-80 compared to Eigen LLT. The code is here.

----------------------------------------------
----------------------------------------------
## Benchmark ::Eigen Vs. ViennaCL Performance 
----------------------------------------------

-- Times for 16x16 Matrix --
Precision: 2.22045e-16
Using the device Tahiti on the platform AMD Accelerated Parallel Processing
Time for Partial Pivot LU (Eigen): 0.000151
----------------------------------------------
Time for LLT (Eigen): 8e-05
----------------------------------------------
Time for LU (ViennaCL): 0.000465
----------------------------------------------
Time for BayesCL 0.001324
----------------------------------------------
Eigen LLT and ViennaCL Lower Diagonal Match!
----------------------------------------------
Eigen LLT and BayesCL Lower Diagonal Match!
----------------------------------------------

-- Times for 100x100 Matrix --
Precision: 2.22045e-16
Using the device Tahiti on the platform AMD Accelerated Parallel Processing
Time for Partial Pivot LU (Eigen): 0.004632
----------------------------------------------
Time for LLT (Eigen): 0.002384
----------------------------------------------
Time for LU (ViennaCL): 0.281849
----------------------------------------------
Time for BayesCL 0.0068
----------------------------------------------
Eigen LLT and ViennaCL Lower Diagonal Match!
----------------------------------------------
Eigen LLT and BayesCL Lower Diagonal Match!
----------------------------------------------

-- Times for 400x400 Matrix --
Precision: 2.22045e-16
Using the device Tahiti on the platform AMD Accelerated Parallel Processing
Time for Partial Pivot LU (Eigen): 0.185878
----------------------------------------------
Time for LLT (Eigen): 0.090528
----------------------------------------------
Time for LU (ViennaCL): 0.227896
----------------------------------------------
Time for BayesCL 0.027893
----------------------------------------------
Eigen LLT and ViennaCL Lower Diagonal Match!
----------------------------------------------
Eigen LLT and BayesCL Lower Diagonal Match!
----------------------------------------------

-- Times for 900x900 Matrix --
Precision: 2.22045e-16
Using the device Tahiti on the platform AMD Accelerated Parallel Processing
Time for Partial Pivot LU (Eigen): 1.9566
----------------------------------------------
Time for LLT (Eigen): 0.930773
----------------------------------------------
Time for LU (ViennaCL): 0.256151
----------------------------------------------
Time for BayesCL 0.070389
----------------------------------------------
Eigen LLT and ViennaCL Lower Diagonal Match!
----------------------------------------------
Eigen LLT and BayesCL Lower Diagonal Match!
----------------------------------------------

-- Times for 1600x1600 Matrix --
Precision: 2.22045e-16
Using the device Tahiti on the platform AMD Accelerated Parallel Processing
Time for Partial Pivot LU (Eigen): 10.4645
----------------------------------------------
Time for LLT (Eigen): 5.08227
----------------------------------------------
Time for LU (ViennaCL): 0.783947
----------------------------------------------
Time for BayesCL 0.152112
----------------------------------------------
Eigen LLT and ViennaCL Lower Diagonal Match!
----------------------------------------------
Eigen LLT and BayesCL Lower Diagonal Match!
----------------------------------------------

-- Times for 2025x2025 Matrix --
Precision: 2.22045e-16
Using the device Tahiti on the platform AMD Accelerated Parallel Processing
Time for Partial Pivot LU (Eigen): 21.1169
----------------------------------------------
Time for LLT (Eigen): 10.3092
----------------------------------------------
Time for LU (ViennaCL): 1.04972
----------------------------------------------
Time for BayesCL 0.208007
----------------------------------------------
Eigen LLT and ViennaCL Lower Diagonal Match!
----------------------------------------------
Eigen LLT and BayesCL Lower Diagonal Match!
----------------------------------------------

-- Times for 3600x3600 Matrix --
Precision: 2.22045e-16
Using the device Tahiti on the platform AMD Accelerated Parallel Processing
Time for Partial Pivot LU (Eigen): 117.924
----------------------------------------------
Time for LLT (Eigen): 57.5248
----------------------------------------------
Time for LU (ViennaCL): 3.21366
----------------------------------------------
Time for BayesCL 0.733203
----------------------------------------------
Eigen LLT and ViennaCL Lower Diagonal Match!
----------------------------------------------
Eigen LLT and BayesCL Lower Diagonal Match!
----------------------------------------------

OK. We should then start with double precision and try with single precision if we get some positive results on a case study that Bob is mentioning. Enabling single-precision is not that big of a task after all and we also know what speedups to expect from this.

Maybe an alternative stan-math-gpu library?

I agree. So we would basically create copies of files like this /prim file and this /rev file and replace the code with GPU code where we can?

We have looked into combining ViennaCL with custom OpenCL code and it looks that this is seamless and we can mix both if we needed to. Our suggestions would be to start by implementing ViennaCL in the biggest bottleneck and move from there (as you said). We know what the bottleneck are for our models/datasets, but you definetly have a broader view of this.


#6

Do you happen to have Intel MKL and Xenon Phis available? It would be very interesting to see this in comparison. This combination can be used already now to speedup Stan without any additional change to the code since Eigen can be configured to use the MKL which can be thrown onto the Phis.

I know that this has a hefty $ tag on it, but I also wanted to point out this possibility as it runs already now with the code we have.

In fact, I may even have such a system available to me at some point and I am quite looking forward to some benchmarks on it.

Sebastian


#7

@wds15, thanks for pointing this out. We do have the hardware and we’ll try to provide a benchmark as soon as we find time.


#8

We are proceeding with our project of parallelizing Stan on the GPU, but are still in the early phase of trying to figure out the best approach.

USE EXTERNAL (for example, ViennaCL) OR BUILD CUSTOM LIBRARY

At this point it seems that using an external library would not be that much of a benefit.

First, it should not be underestimated how much would need to be added in order to achieve substantial speedups. While matrix multiplication/inverse/decomposition might be the obvious bottlenecks. But in order to achieve speedups of 100fold, things like parallel draws from distributions, etc… also have to be parallelized. And then there is autodifferentiation. Parallelizing Cholesky is not the same as parallellizing the derivative of Cholesky, etc…

And second, the iterative nature of MCMC offers some unique opportunities (tuning GPU performance across iterations; smartly leaving data on the GPU to avoid unnecessary data transfers,…). But these would be difficult to do with existing libraries.

ALTERNATIVE SAMPLERS

Things would be much more simple to parallelize if we didn’t have to deal with gradients. Has there been any success implementing something like the affine-invariant ensemble sampler by Goodman & Weare (2010) in Stan? Or something similar. Could someone with more experience on this comment if this would be in the same order of efficiency as HMC?

SINGLE/DOUBLE PRECISION

@Stevo15025 ready hinted that double precision is very important. Could someone confirm this and why this is the case?


#9

You won’t get around the gradients… that is what makes HMC so efficient.

Your point about transferring data to the GPU and then reusing it over and over is a good one, but will require some smart trick or brute force work.


#10

As to your big question, the title of my post might give it away:

This “100-fold speedup” thing is slippery. 100-fold speedups are achievable already using @wds15’s OpenMP parallelization of multiple ODE solver calls. That’s just parallelizing the likelihood in a Bayesian model.

If we need a lot of posterior draws (either because we want very high precision or we have very slow mixing), then after adaptation, we can perform those in an embarassingly parallel fashion.

There are some adaptation steps that I think we can parallelize.

Then if you look at something like Riemannian HMC, there are Hessian computations which are embarassingly parallel up to the number of parameters. These dominate RHMC’s compute time. And RHMC mixes amazingly well for hard problems, so this has some chance of making that tractable.

All the data (in the statistical sense) can sit on the GPU. But many of the operands are parameters, which vary draw to draw during MCMC.

We’re not 100% sure how much precision we need for accurate Hamiltonian simulations. We do know that it will vary by model. This could be explored without GPUs and I think it’d make a great project for someone. Matt Hoffman did a little bit in a post either here or on our old mailing list.


#11

What’s the latest on this? I’ve got an old-ish AMD GPU (2013 vintage) and an NVidia 1050Ti I can test on if there’s code somewhere?


#12

#13

Thanks!