ViennaCL with stan: Cholesky Benchmark

Below in the dropbox link is a standalone benchmarking example for performing Cholesky Decomposition in ViennaCL and Eigen. It compiles via g++ with the -lOpenCL compiler flag. This requires having OpenCL (Whether through NVIDIA or AMD) installed on your computer as well as a GPU.

Running make will create eigen_vienna_LU_compare.o which will run the benchmark and give output such as the one below. These results come from an NVIDIA GTX 780. For the largest matrix example, the 3600x3600 case, ViennaCL with an OpenCL backend gives a 17.5x speedup relative to the LLT of Eigen and a 35.7x speed up relative to the LU of Eigen. The U in the ViennaCL and Eigen LU decomposition is transposed and then divided column wise by the matrices diagonal elements to give the L of the Cholesky. Results match across Eigen and ViennaCL up to my computers double machine precision 2.22045e-16.

In the meeting today we discussed next steps as something like Rob and I speaking about how to integrate this with auto-diff. The code is fairly simple so I have hopes this will actually be nice!

I would suggest at this point someone reviews the code to replicate my results. If you do, please post your GPU make/model, the benchmark, and if possible the results from running a deviceQuery (which should be in the samples/1_Utilities folder if you are running nvcc)

Note: ViennaCL actually gives a number of GPU devices which they have tested on and you can view here

Benchmark Results

 -- Times for 16x16 Matrix --
Precision: 2.22045e-16
Time for Partial Pivot LU (Eigen): 0.000705
----------------------------------------------
Time for LLT (Eigen): 0.000322
----------------------------------------------
Time for LU (ViennaCL): 0.000166
----------------------------------------------
Eigen LU Partial Pivot and ViennaCL Lower Diagonal Match!
----------------------------------------------
Eigen LLT and ViennaCL Lower Diagonal Match!
----------------------------------------------

 -- Times for 100x100 Matrix --
Precision: 2.22045e-16
Time for Partial Pivot LU (Eigen): 0.004182
----------------------------------------------
Time for LLT (Eigen): 0.002207
----------------------------------------------
Time for LU (ViennaCL): 0.00578
----------------------------------------------
Eigen LU Partial Pivot and ViennaCL Lower Diagonal Match!
----------------------------------------------
Eigen LLT and ViennaCL Lower Diagonal Match!
----------------------------------------------

 -- Times for 400x400 Matrix --
Precision: 2.22045e-16
Time for Partial Pivot LU (Eigen): 0.173933
----------------------------------------------
Time for LLT (Eigen): 0.084297
----------------------------------------------
Time for LU (ViennaCL): 0.038564
----------------------------------------------
Eigen LU Partial Pivot and ViennaCL Lower Diagonal Match!
----------------------------------------------
Eigen LLT and ViennaCL Lower Diagonal Match!
----------------------------------------------

 -- Times for 900x900 Matrix --
Precision: 2.22045e-16
Time for Partial Pivot LU (Eigen): 1.84785
----------------------------------------------
Time for LLT (Eigen): 0.872908
----------------------------------------------
Time for LU (ViennaCL): 0.179966
----------------------------------------------
Eigen LU Partial Pivot and ViennaCL Lower Diagonal Match!
----------------------------------------------
Eigen LLT and ViennaCL Lower Diagonal Match!
----------------------------------------------

 -- Times for 1600x1600 Matrix --
Precision: 2.22045e-16
Time for Partial Pivot LU (Eigen): 9.91686
----------------------------------------------
Time for LLT (Eigen): 4.83424
----------------------------------------------
Time for LU (ViennaCL): 0.576174
----------------------------------------------
Eigen LU Partial Pivot and ViennaCL Lower Diagonal Match!
----------------------------------------------
Eigen LLT and ViennaCL Lower Diagonal Match!
----------------------------------------------

 -- Times for 3600x3600 Matrix --
Precision: 2.22045e-16
Time for Partial Pivot LU (Eigen): 109.988
----------------------------------------------
Time for LLT (Eigen): 53.9652
----------------------------------------------
Time for LU (ViennaCL): 3.07835
----------------------------------------------
Eigen LU Partial Pivot and ViennaCL Lower Diagonal Match!
----------------------------------------------
Eigen LLT and ViennaCL Lower Diagonal Match!
----------------------------------------------

deviceQuery

./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "GeForce GTX 780"
  CUDA Driver Version / Runtime Version          8.0 / 8.0
  CUDA Capability Major/Minor version number:    3.5
  Total amount of global memory:                 3018 MBytes (3164209152 bytes)
  (12) Multiprocessors, (192) CUDA Cores/MP:     2304 CUDA Cores
  GPU Max Clock rate:                            902 MHz (0.90 GHz)
  Memory Clock rate:                             3004 Mhz
  Memory Bus Width:                              384-bit
  L2 Cache Size:                                 1572864 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 1 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 1 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 8.0, CUDA Runtime Version = 8.0, NumDevs = 1, Device0 = GeForce GTX 780
Result = PASS
1 Like

Also, looking at the stan math library I believe here is where this would go.

Q1: What is the easiest way to integrate this into the current framework?
Q2: How should users be exposed to this?

For (1), we need to check if the user has an OpenCL capable GPU at compile time and compile the ViennaCL code with the option to use the GPU

For (2) I think it would just be a flag or something at run time like ‘gpu’ that will access the ViennaCL version

This is great! I’m super excited about getting this in. The file you’re pointing to is in the primitive math library. Where we really need the speedup is in the reverse-mode automatic differentiation library, so same path, with prim replaced by rev.

You shold try to coordinate with Rob Trangucci, who’s been taking the lead on speeding up matrix computations and/or with Michael Betancourt and Ben Goodrich, who understand the computational and derivative issues. I can help with code organization and Sean, who’s on vacation for a while, should be able to help with the builds.

Right now, the bottleneck in Stan is computing the log density function and its gradients. For gradients, we need the first derivative of every output w.r.t. every input coded efficiently into an expression graph. The nice thing is that the derivatives are just some more matrix multiplies, and speeding those up will also be a huge win, as that’s where 99% of our computation time goes.

Thanks Bob I am also very excited!

@rtrangucci do you have time to meet up this week?

It may be a naive hope, but maybe we can just change things between lines 136 and place my ViennaCL code there.

Bob do you mean @seantalts? I’m a total babe in the woods when it comes to how we should compile this and would love to talk with someone about it. Though he is on vacation and I don’t want to interrupt that.

Depending on if the user has a GPU we want to this define to execute

#ifndef VIENNACL_WITH_OPENCL
  #define VIENNACL_WITH_OPENCL
#endif

which just tells ViennaCL, “This is going to the GPU.”

And we also only want to go to the GPU version if the user asks for it

Am also very interested in hearing the thoughts of @rok_cesnovar and his team

I am just in the process of opening a new topic on the forum to discuss this. I have ran your code and our implementation of the cholesky on two GPUs (HD7570 & K20m). Will run on two more tomorrow. Hopefully this will be of some help.

@stevebronder

Yeah, I have time this week. Can we meet after Michael’s talk on Thursday?

@rtrangucci sure!

@stevebronder, any more progress? I’m excited too.

Apologies, I became busy with life things before I start my job next week. I met up with Rob and he showed me what and how we need to change within the current Cholesky code. Going to fork stan-math on github and make an example that should be done tomorrow.

Made a fork of the stan-math library on github. It’s a bit pseudocode’ish as I’m not sure how we compile it. I think we need to check whether the user has an OpenCL capable GPU so we can compile with -lOpenCL and define the ViennaCL and OpenCL connection

Big thanks to @rtrangucci for writing most of the code for this, which I have attempted to fit inside of the normal framework.

Would this be easier for others to edit if it was a branch on stan-math? I think I would need contributor rights

Changes:

  1. Adding the ViennaCL library in lib
  2. Adding viennacl to the libraries file in make
  3. A viennacl.hpp file that is similar to Eigen.hpp in prim
  4. Adding a cholesky_gpu class in cholesky_decompose.hpp

Below is the link to cholesky_decompose.hpp. I have most of it sorted out, but need to double check that

  1. upper_tag() and lower_tag() function in ViennaCL’s solve() returns just the upper or lower triangle.
  2. How to calculate the adjoint in ViennaCL
1 Like

Math is (almost) a header-only library. It can still be compiled header only. What are you looking to compile? If you have a test of some sort, it’s easy to compile that. If you’re looking to compile a Stan program, you’ll want to do it from an interface – CmdStan will be easiest.

Yes. Most definitely. We can give that to you. @seantalts, want to set Steve up (github: Stevo15025)?

Overall: this looks awesome. How can the rest of us be helpful to you?

What are you looking to compile?

Sorry for the confusion. To clarify,

  1. The program currently moves to the Cholesky GPU code if the number of rows exceeds 500. Though that’s silly. I would like a flag the user can specify in CmdStan (-gpu=TRUE) which would tell the program to use the GPU version of the code.
  2. I want to see whether or not I can write a stan program and have it run given the changes I made
  3. There should be a test that runs the cpu and gpu versions of cholesky and checks if their results are equal.
  4. For ViennaCL to compile and use the OpenCL backend it needs to have #define VIENNACL_WITH_OPENCL as well as the -lOpenCL option for g++. Otherwise it will compile the CPU version of ViennaCL.

I’m confused about the best solution for (3). The nicest thing would be, if the user sets the -gpu flag to TRUE when compiling the stan program

  1. The stan-math library is compiled with the -lOpenCL flag
  2. The if statement to include #define VIENNACL_WITH_OPENCL passes
  3. The Cholesky code’s if statement sends the program to the GPU version of the code

Once I get contributor rights (also for @rok_cesnovar) then I will make a branch and a pull request, which should make this easier to discuss and help.

Overall: this looks awesome. How can the rest of us be helpful to you?

Thanks! I need to figure out how to compile a simple stan program using the GPU backend, so that’s mostly on me. If anyone can clarify or add onto my above thoughts it would be appreciated.

Steve’s been sent an invite to join the organization as a member and become a part of the “math team.”

We might want to have a larger discussion sometime about the general collaboration model; for example we support Jenkins testing pull requests from forks now :D I’ve been reading the book Bob posted, notably http://producingoss.com/en/committers.html and http://producingoss.com/en/consensus-democracy.html#electorate

1 Like

Received and accepted! I pinky promise not to approve or touch anything that I should not

Tomorrow I will move everything over to a branch and make a pull request for discussion

Discussion can be moved to https://github.com/stan-dev/math/pull/529

Good idea. Let’s try to develop a more official policy on adding people to he dev team, allowing people to approve pull requests, getting admin privileges, etc.

1 Like

@stevebronder: first off, congratulations! I’m replying here to consolidate the discussions over pull requests on CmdStan#549 and Math#529.

Since we’re much further along, I think it’s a good time to think about to get things into the codebase for users to use. I’ve thought about it a bit, but I have a bunch of questions and I think I have a different position than what I was thinking last week.

Math

This is for users of the math library in C++.

  • Do you expect users to have the ability to call both a CPU and GPU version of cholesky_decompose within a single C++ program?
    At first, I was thinking that it should either be one or the other depending on settings determined at compile time, but the more I think about it, the more I think we should have two separate functions that could be called. I didn’t look very carefully at the benchmarks, but I could imagine that the tradeoffs for size vs speed are dependent on the actual hardware and someone looking to use the math library may want to control that themselves. Another use case may be that cholesky_decompose is called on matrices of different sizes within a single function call, so it may be optimal to use CPU for one and GPU for another. Does that make sense? If so, I’ll put comments on the pull request.

  • If we go with two separate functions, what should be the behavior of code compiled with access only to CPU?
    I see two sensible options, but there may be a better way to implement this. If the GPU function is called, but the compiler options don’t specify the GPU:

    1. Refuse to compile. Hopefully have a reasonable compiler error message.
    2. Compile the CPU version in place of the GPU version. At compile time, provide compiler warnings that indicate the GPU function is being compiled with a CPU version of the function. At runtime, provide warning messages that indicate that the GPU call is running with the CPU version.
      Thoughts? I think my preference would be to compile the CPU version in place for convenience. That would allow code to be compilable across different platforms, but perhaps it’s better to fail early.
  • Do we want to provide a function that automatically dispatches to CPU vs GPU depending on compiler options?
    If so, should it be the same function name or a different function name? We should handle the makefile options differently to make it work, but I can help with that. We can spec out how we want users to compile these things and then make it happen.

Stan language

These are possible additions we could consider to accommodate the GPU. They pretty much mirror the comments above.

  • Do we have two functions, say something like choleksy_decompose() (for CPU) and cholesky_decompose_gpu()?
  • Do we have one function, cholesky_decompose(), that automatically calls the GPU version dependent on the make / C++ compiler options?
  • If we have one function, should we consider dispatching to CPU vs GPU function call conditioned on size?
  • If we have two functions, do we want models specified with the GPU to be compiled with the CPU version if the GPU is not available? With runtime error messages, of course.

I think I’d prefer 2 functions with it compilable under CPUs. I’m thinking about how we would still want a user to pass a Stan program to the list to get help. We may want to compile the program directly and locally on a machine that may not have a GPU available.

CmdStan

I think once we figure out what we want for Math and Stan, we can figure out how to make it work from the CmdStan interface.

RStan / PyStan

These might be trickier, but I think once we have something working for CmdStan, we should be able to figure it out.

2 Likes

I very much agree with this. For the benchmarks the GPU really only started getting better when we reach sizes greater than 400x400.

My personal preference is 1. While convenient, GPU’s are still pretty new tech and I think it would be better to be strict about this sort of thing.

FTR, the reason I was checking on size in the testing was out of convenience

Can we use the RViennaCL package?

1 Like

I think there’s an implicit question here: Do we want to move forward with getting this GPU Cholesky decomposition into Stan now given that @rok_cesnovar is also working on similar things elsewhere? I’m going to assume yes for now; I think this is reasonable because it will immediately give results to users while they’d otherwise wait for a potentially more comprehensive solution, and it will allow us to iron out build, tooling, and API issues along the way.

Re: how users should use Stan w/ GPUs: Here I would stray from @syclik’s position and say we should keep things simple for the user and for us for the first iteration of the API. I think having a single flag that activates all GPU code we have and expects OpenCL to be installed correctly (and has a nice error if not) is the best route.

This would imply that there’s no new additions to the Stan language and that users can’t fine-tune when and where a GPU is used within a single model. I think this position is supported by the benchmarks above where the GPU is faster in every situation except the 100x100 case (where it is about twice as slow (3ms) as the LLT). I think two functions would increase complexity for users and developers in exchange for likely very little value-add; it seems like users are unlikely to have a model where they need GPUs but the difference for a 100x100 Cholesky decompose makes a big difference to their end times. We could eventually look into adaptive thresholds with the same API, and this neatly simplifies the other issues Dan mentioned above (about compiling models pasted to the list with just CPUs, figuring out when to dispatch to CPU vs GPU with graceful fallback, less error conditions to keep track of).

I’d like to help @stevebronder get this in (or possibly just take over the remaining work, which I think just centers around the above issue).