ViennaCL with stan: Cholesky Benchmark

That seems fine for CmdStan, which is built mostly with makefiles, but I don’t know how RStan / PyStan would work with that? I suppose they also use makefiles under the hood somewhere, right? If that all works nicely then it does seem a bit better to do it that way in case adding the include files and link library incurs overhead.

@bgoodri Does RStan have a makefile somewhere that’s used to compile models? Trying to scope out Daniel’s idea for a makefile variable instead of -D STAN_GPU for RStan (to handle adding the clang flag that includes -l OpenCL and ViennaCL).

It has a Makevars file, which is typically at ~/.R/Makevars .

Is there some implicit or templated makefile that’s used when Stan models are actually compiled? I didn’t see one in the rstan github but I’m not sure if it’s generated somehow, or what…

R CMD SHLIB Modules.cpp executes something like

  make -f 'Makevars' -f '/usr/lib/R/etc/Makeconf' -f '/usr/share/R/share/make/' -f '/home/goodrich/.R/Makevars' CXX='$(CXX14) $(CXX14STD)' CXXFLAGS='$(CXX14FLAGS)' CXXPICFLAGS='$(CXX14PICFLAGS)' SHLIB_LDFLAGS='$(SHLIB_CXX14LDFLAGS)' SHLIB_LD='$(SHLIB_CXX14LD)' SHLIB='' OBJECTS='Modules.o'

Gotcha. So it looks to me like we could use what Steve currently has:

ifneq (,$(findstring STAN_GPU, $(CFLAGS)))
    CFLAGS += -lOpenCL

…to achieve something similar to what Daniel posted above about a makefile variable, but it looks to me like just having a makefile variable that is then used to add both -D STAN_GPU and -lOpenCL (or -framework OpenCL on OS X) might be a little simpler and therefore preferable.

@stevebronder what do you think about that? Or anyone else want to weigh in? I’m happy to let you choose since you’re implementing (thanks!).

I also have a proposal for the testing situation. It has a couple of components:

  • a new directory for tests that require a GPU, probably test/gpu
  • a new makefile target for tests under test/gpu that has the appropriate includes and libraries, mostly copying what exists for unit tests
  • wherever we want to add the appropriate includes and libraries, the ViennaCL include should be cross platform but for Mac we will need something like this:
ifeq "$(OS_TYPE)" "mac"
CFLAGS += -framework OpenCL
  • some tests in the test/gpu directory; though I think it might be best to just include whatever unit tests exist for e.g. cholesky_decompose in the new _test.cpp file since the only thing that should be different are the compiler options.

I think this will get us covered, and once that’s in place I can take over getting it to work on Jenkins (making sure it has OpenCL, etc) and adding the test/gpu directory to the list of tests to run.

So I have finally gathered all the results. I wanted to test everything as much as I can in order to not make any quick judgments. For testing purposes I use a GTX1070 which is a high-end GPU and compared it with the execution times on a high-end i7 CPU.

I took out the calculation form the chain() function and studied the performances in a separate project. The code is here:

EDIT: I put the input matrices into a separate file:

Some quick thoughts, we can discuss it in more detail tomorrow when we talk.

1: Full ViennaCL implementation

Speedup to Eigen: 6-7 for M=2500, 9-10+ for 5000
Notes: a huge numerical error (1e-1 or 1e-0). However, when I checked this version in stan_maths tests (cholesky_decompose_test), it either had a small numerical error, or PASSED.
I am not really sure why, but I guess because M in the tests is 1000.

2: ViennaCL - the inplace solves calculated with Eigen (CPU)

Speedup to Eigen: 1,5- 2,5

Notes: I tried figuring out what is causing this numerical errors. It seems that there are some problems with the implementation of the inplace solve.
If I only execute the solves with Eigen, the numerical error is 3,2e-14 which seems fine. If I execute the first solve on the CPU and the second on the GPU, then in 9 out of 10 times, the numerical error is 3e-14, but it sometime the error is 1e-2 or 1e-1. This sort of random behaviour suggests some kind of a race condition.

3: Custom OpenCL

Speedup to Eigen: 7-8 for M=2500, 10+ for 5000
Notes: We have rewritten the whole algorithm in custom OpenCL. The inplace solves are re-done as an inverse and 2 matrix multiplications. We used some optimizations for matrix multiplications
(some for triangular matrix mult, some general). The custom inverse performs well and is not a bottleneck (the implementation is a generalization of this pretty old idea:



------------- Eigen implementation ---------
time: 4.80901
------------- OpenCL implementation ---------
numeric error: 6.39488e-14
time: 0.631345
------------- ViennaCL implementation - full ---------
numeric error: 0.282534
time: 0.701411
------------- ViennaCL implementation - cpu solve ---------
numeric error: 3.55271e-14
time: 3.00574
------------- ViennaCL implementation - cpu half solve ---------
numeric error: 3.55271e-14
time: 2.02017

------------- Eigen implementation ---------
time: 38.6552
------------- OpenCL implementation ---------
numeric error: 7.54952e-15
time: 3.65901
------------- ViennaCL implementation - full ---------
numeric error: 0.0590687
time: 3.82686
------------- ViennaCL implementation - cpu solve ---------
numeric error: 3.21965e-15
time: 23.1509
------------- ViennaCL implementation - cpu half solve ---------
numeric error: 3.55271e-15
time: 14.1589
1 Like

Neat! I’ll have to look over this a bit more


Is this the test here that you are referencing? Oddly, putting in M = 2500 works fine, but an M = 3000 throws the error.

unknown file: Failure
C++ exception with description "cholesky_decompose: A is not symmetric. A[9,10] = inf, but A[10,9] = inf" thrown in the test body.
[  FAILED  ] AgradRevMatrix.mat_cholesky_1st_deriv_large_gradients (31496 ms)

Glad you caught this!

It’s late here, but I’ll look this over more before our chat tomorrow.

Had a nice discussion over google hangouts with with @rok_cesnovar today. Wanted to leave notes here for those interested and to make sure Rok, his team, and myself are on the same page.

Short Term:

The numerical instability of the full ViennaCL implementation seems to come from the gpu solve function in the derivative. To resolve this we are going to use the solver Rok and his team implemented. Their solution is faster and numerically stable. We are going to take their algorithm and wrap it up in ViennaCL like we do with the upper_tri function.

Long Term:

Rok and his colleague Eric want to make a separate GPU library that stan will call to perform GPU calculations. So instead of calling ViennaCL or using it as a wrapper for our own algorithms we can have a separate library that is tailored towards the algorithms stan needs. I agree with this path and with the amount of care I see from both teams I think transferring from ViennaCL to their own package should not be difficult in the future.

Another idea Eric brought up would involve tuning the GPU ‘parameters’ (ie block size and things) during the warm-up phase. the speed of GPU based calculations can be pretty dependent on tuning the block size for the specific GPU hardware and task. Having something like 1000 different instances during warm-up gives us the opportunity to measure the best block size.

Rok does that cover what we talked about?


I think this covers it yes. Hope my English was not too bad :)

We discussed about giving me the rights to the github repository, but for now we will finish the short term without it. I will post the code on the github pull request thread and when you will have the time, you will commit an push the changes.

When the short term is done, we can then start the discussion on your wishes for the library interface and the road map on what should be implemented first. We are excited!

1 Like

The overall plan sounds good assuming the new code is licensed under BSD or something similar rather than under GPL.

As far as tuning during warmup goes, the issue is going to be one of system architecture. Right now, the Stan models are stateless and the only state maintained iteration to iteration is the pseudorandom number generator.

I’m no longer a GitHub owner, but @seantalts should be and he can set you up with permissions.

The code will be licensed under BSD.

Fantastic! In case you haven’t seen, we’re also working on MPI at the same time so that we can parallelize lage likelihoods.

With both GPUs and distributed multi-core, we should be in business for seriously speeding up big models. Especially since the speedups are pretty much independent of one another.


Neat! In the future we can think about MPI + multiple GPUs too. :-P

I have managed to insert our inverse in the viennaCL implementation, the numerical error is now no longer a problem and random as it was before. The execution time of this implementation is 40-60% slower compared to the custom OpenCL implementation, but the derivative in the chain() function is still 4-5 times faster than Eigen. On the GTX1070 that is.

The code is attached in the file. chain_vcl_ocl.txt (14.1 KB)

As we discussed with Steve, the next step is to replace the existing chain function with this one and push this to the gpu_cholesky branch. Once we have that, it would be great if you guys can confirm this implementation of the Cholesky decomposition is the way you envision things (in terms of where the GPU code is, the _gpu class, etc.) Just so we can have a reference implementation for all future implementations.

Just a heads up @stevebronder if you have already used the code I posted. I updated the file, as I found a bug in the previous version.

No worries! My gpu at home is currently being used for another project, but I’ll put the new version with your solver up tonight.

@rok_cesnovar I just saw this when looking for performance numbers for my roadmap talk. I’ve been told inverse is really unstable. Is this still the plan?

Yes, this is how it is implemented for now.

Just to clarify, this is meant for the lower triangular matrix inverse.

This is used inside the cholesky_decompose derivative and was also used to implement mdivide_left_tri. This passes all the test and the GP models we tested this for (for example, showed no issues.

For the regular inverse and solves, we are still looking at options in terms of performance and stability.

Thanks for the clarification. That makes a lot more sense. (I’m a rank amateur in terms of matrix arithmetic implementations and not much better at matrix algebra without computers.)