How to do cache on the GPU?

Mostly writing here to make my thoughts more solid.

We want to have a cache on the GPU so that data is only transferred once. I was thinking something like this.

  1. Add a flag opencl_copy_flag (?) to eigen data matrices so we know whether or not a matrix has moved over to the GPU
  2. We can either:
    a. Have a dictionary in the opencl_context that holds an identifier for the matrix, a pointer to the data buffer on the GPU, and a counter for how many times this matrix has been used. Sort of like @lru_cache in python. We would also need to keep track of the amount of space left in the cache, so when a matrix is moved over we check if we have space for it in the cache and if not replace the least used item in the cache.
    b. Attach the buffer pointer to the eigen matrix, then in the copy func, if the opencl_copy_flag is true then return the pointer instead of doing the copy.

I have to think about (b) more but I’m pretty sure we could do the same thing with the cache space like in (a).

One thing I’m uncertain about is how to handle in place operations we perform on the data on the GPU. I’m thinking either

  1. Turn off the flag so we’ll have to copy it again
  2. Keep a separate copy of the original matrix in the cache, this would make the original transfer a bit more expensive
  3. If we only do in place operations on data where other arguments are also data, then I think we could just skip those operations if we already did them once?

(3) is sort of half baked, I thought of it as I was biking to work this morning, but it sort of makes sense? Mby there is some way to mix (3) and (2).

Looking at the example below I think we want something like

  1. Pass x1 over once, do the cov_exp_quad on the GPU, storing K’s buffer in the cache
  2. If nothing happens to K between cov_exp_quad and cholesky_decompose then get K out of the cache instead of transferring it.

idk how we would do 2 cheaply, maybe something in the parser?

1 Like

To update this, started messing around with a cache on the GPU at the branch below.

Some timings here for our normal copy in nanoseconds for a 5K by 5K matrix.

First Copy Time:  35608776
Second Copy Time: 35344843
First Copy Time:  35417013
Second Copy Time: 35573540
First Copy Time:  35392648
Second Copy Time: 35526611
First Copy Time:  35381343
Second Copy Time: 35520795
First Copy Time:  35407417
Second Copy Time: 35473417
First Copy Time:  35357669
Second Copy Time: 35480738
First Copy Time:  35290712
Second Copy Time: 35566842
First Copy Time:  35414041
Second Copy Time: 35616623
First Copy Time:  35378006
Second Copy Time: 35621645
First Copy Time:  35444495
Second Copy Time: 35635690
First Copy Time:  35391877
Second Copy Time: 35687782

And below timings are for the cache copy

First Copy Time: 36949949
Second Copy Time:  290583
First Copy Time: 36757017
Second Copy Time:  335493
First Copy Time: 37698100
Second Copy Time:  280350
First Copy Time: 36817978
Second Copy Time:  281890
First Copy Time: 36694587
Second Copy Time:  285277
First Copy Time: 36739636
Second Copy Time:  279628
First Copy Time: 37004972
Second Copy Time:  280978
First Copy Time: 36699635
Second Copy Time:  300999
First Copy Time: 36713995
Second Copy Time:  279391
First Copy Time: 37126617
Second Copy Time:  285708
First Copy Time: 36822464
Second Copy Time:  289154

So it seems like filling the cache on the GPU takes about 1-3 extra milliseconds but saves us about 35-37 milliseconds on the second transfer. So that’s nice!

Implementation is not great atm, mostly because I have to include a logical indicating if the Eigen matrices OpenCL buffer is filled already instead of directly checking the buffer. My google-fu is failing me terribly right now and I can’t figure out how to check if a cl::Buffer is valid.

1 Like

You might want to control the assemby for that on the CPU by using an likely or unlikely wrapper as a compiler hint. It’s defined in

#include <stan/math/prim/scal/meta/likely.hpp>
1 Like

Bob, this might be the right thread to ask about your reservations about putting a cl::Buffer pointer on all Eigen matrices of doubles. I think I had previously asked around and done a few experiments with poor coverage to think that we never generate Stan code from the language that modifies an already existing Eigen matrix of doubles in the model block. There was a point made by @syclik that arithmetic expressions like A + B generate new Eigen matrices of doubles (truth), but they compile down to add(A, B) and all of the functions like that return a new Eigen matrix. So if we add a cl::Buffer type to all Eigen matrices, it will be initialized to NULL even for arithmetic expressions like that. Can you think of other ways we might construct an Eigen matrix of doubles that changes values in the model block?

I’m not sure what you mean by an already existing. You can’t modify matrices declared in other blocks in the model block. But within the model block, it’s no holds barred. Our standard GP code does stuff like this:

model {
  matrix[K, K] Sigma;
  for (k in 1:K) for (k2 in 1:K) Sigma[k, k2] = ...;

But we also might want to do things like:

matrix[M, N] A;
A *= -0.5;

Yep, but within the model block right now we don’t create new Eigen matrices of doubles other than as rvalues, right? Meaning we will do stuff like normal(beta * weights') and beta * weights' will create two new Eigen matrices of doubles but they will not be saved nor modified after creation. As far as I can tell.

[edit] in both of your examples I believe Sigma and A are both going to be matrices of var, right?

For now, yes. But they should be double and would be double if we had better type inference. It’s only for programming convenience that they’re all lifted to var (or whatever the template type is).

Also, when we evaluate the generated quantities block each iteration, it’s all done with double values if that matters.

That’s right—if there are intermediate rvalues, they will never get modified.