How to do cache on the GPU?

stevebronder · October 19, 2018, 3:51pm

Mostly writing here to make my thoughts more solid.

We want to have a cache on the GPU so that data is only transferred once. I was thinking something like this.

Add a flag opencl_copy_flag (?) to eigen data matrices so we know whether or not a matrix has moved over to the GPU
We can either:
a. Have a dictionary in the opencl_context that holds an identifier for the matrix, a pointer to the data buffer on the GPU, and a counter for how many times this matrix has been used. Sort of like @lru_cache in python. We would also need to keep track of the amount of space left in the cache, so when a matrix is moved over we check if we have space for it in the cache and if not replace the least used item in the cache.
b. Attach the buffer pointer to the eigen matrix, then in the copy func, if the opencl_copy_flag is true then return the pointer instead of doing the copy.

I have to think about (b) more but I’m pretty sure we could do the same thing with the cache space like in (a).

One thing I’m uncertain about is how to handle in place operations we perform on the data on the GPU. I’m thinking either

Turn off the flag so we’ll have to copy it again
Keep a separate copy of the original matrix in the cache, this would make the original transfer a bit more expensive
If we only do in place operations on data where other arguments are also data, then I think we could just skip those operations if we already did them once?

(3) is sort of half baked, I thought of it as I was biking to work this morning, but it sort of makes sense? Mby there is some way to mix (3) and (2).

Looking at the example below I think we want something like

Pass x1 over once, do the cov_exp_quad on the GPU, storing K’s buffer in the cache
If nothing happens to K between cov_exp_quad and cholesky_decompose then get K out of the cache instead of transferring it.

idk how we would do 2 cheaply, maybe something in the parser?

github.com

stan-dev/example-models/blob/master/misc/gaussian-process/gp-predict-analytic.stan#L59


real delta = 1e-9;
}
parameters {
real<lower=0> rho;
real<lower=0> alpha;
real<lower=0> sigma;
}
model {
matrix[N1, N1] L_K;
{
  matrix[N1, N1] K = cov_exp_quad(x1, alpha, rho);
  real sq_sigma = square(sigma);


  // diagonal elements
  for (n1 in 1:N1)
    K[n1, n1] = K[n1, n1] + sq_sigma;
  
  L_K = cholesky_decompose(K);
}


rho ~ inv_gamma(5, 5);

stevebronder · October 30, 2018, 4:54pm

To update this, started messing around with a cache on the GPU at the branch below.

Some timings here for our normal copy in nanoseconds for a 5K by 5K matrix.

First Copy Time:  35608776
Second Copy Time: 35344843
First Copy Time:  35417013
Second Copy Time: 35573540
First Copy Time:  35392648
Second Copy Time: 35526611
First Copy Time:  35381343
Second Copy Time: 35520795
First Copy Time:  35407417
Second Copy Time: 35473417
First Copy Time:  35357669
Second Copy Time: 35480738
First Copy Time:  35290712
Second Copy Time: 35566842
First Copy Time:  35414041
Second Copy Time: 35616623
First Copy Time:  35378006
Second Copy Time: 35621645
First Copy Time:  35444495
Second Copy Time: 35635690
First Copy Time:  35391877
Second Copy Time: 35687782

And below timings are for the cache copy

First Copy Time: 36949949
Second Copy Time:  290583
First Copy Time: 36757017
Second Copy Time:  335493
First Copy Time: 37698100
Second Copy Time:  280350
First Copy Time: 36817978
Second Copy Time:  281890
First Copy Time: 36694587
Second Copy Time:  285277
First Copy Time: 36739636
Second Copy Time:  279628
First Copy Time: 37004972
Second Copy Time:  280978
First Copy Time: 36699635
Second Copy Time:  300999
First Copy Time: 36713995
Second Copy Time:  279391
First Copy Time: 37126617
Second Copy Time:  285708
First Copy Time: 36822464
Second Copy Time:  289154

So it seems like filling the cache on the GPU takes about 1-3 extra milliseconds but saves us about 35-37 milliseconds on the second transfer. So that’s nice!

Implementation is not great atm, mostly because I have to include a logical indicating if the Eigen matrices OpenCL buffer is filled already instead of directly checking the buffer. My google-fu is failing me terribly right now and I can’t figure out how to check if a cl::Buffer is valid.

Bob_Carpenter · November 4, 2018, 10:17pm

You might want to control the assemby for that on the CPU by using an likely or unlikely wrapper as a compiler hint. It’s defined in

#include <stan/math/prim/scal/meta/likely.hpp>

seantalts · November 5, 2018, 2:33pm

Bob, this might be the right thread to ask about your reservations about putting a cl::Buffer pointer on all Eigen matrices of doubles. I think I had previously asked around and done a few experiments with poor coverage to think that we never generate Stan code from the language that modifies an already existing Eigen matrix of doubles in the model block. There was a point made by @syclik that arithmetic expressions like A + B generate new Eigen matrices of doubles (truth), but they compile down to add(A, B) and all of the functions like that return a new Eigen matrix. So if we add a cl::Buffer type to all Eigen matrices, it will be initialized to NULL even for arithmetic expressions like that. Can you think of other ways we might construct an Eigen matrix of doubles that changes values in the model block?

Bob_Carpenter · November 6, 2018, 5:36am

I’m not sure what you mean by an already existing. You can’t modify matrices declared in other blocks in the model block. But within the model block, it’s no holds barred. Our standard GP code does stuff like this:

model {
  matrix[K, K] Sigma;
  for (k in 1:K) for (k2 in 1:K) Sigma[k, k2] = ...;

But we also might want to do things like:

matrix[M, N] A;
A *= -0.5;

seantalts · November 6, 2018, 1:48pm

Yep, but within the model block right now we don’t create new Eigen matrices of doubles other than as rvalues, right? Meaning we will do stuff like normal(beta * weights') and beta * weights' will create two new Eigen matrices of doubles but they will not be saved nor modified after creation. As far as I can tell.

[edit] in both of your examples I believe Sigma and A are both going to be matrices of var, right?

Bob_Carpenter · November 7, 2018, 12:33am

For now, yes. But they should be double and would be double if we had better type inference. It’s only for programming convenience that they’re all lifted to var (or whatever the template type is).

Also, when we evaluate the generated quantities block each iteration, it’s all done with double values if that matters.

That’s right—if there are intermediate rvalues, they will never get modified.

Topic		Replies	Views
Stanc3 Math lib opencl integration Developers	29	1243	September 23, 2019
GPU Update: what's up and where we are going Developers features , math	29	2582	November 12, 2018
Overall design of GPU work? Developers	1	384	August 20, 2020
Immutable data [GPU and MPI] Developers	7	840	September 17, 2017
Taking advantage of both sparse matrices and GPUs Algorithms	7	831	September 19, 2024

How to do cache on the GPU?

Related topics