Stanc3 Math lib opencl integration

We started talking about this on an unrelated PR on github and I wanted to bring it to discourse.

We’re in the planning phases of trying to get the stanc3 compiler to emit models that can take better advantage of the GPU functionality @rok_cesnovar @stevebronder @tadej et al have been building into Stan Math. The most obviously annoying issue is that right now we are forced to copy data matrices to and from the GPU every leapfrog step even though it is impossible for them to change. Rok mentioned this is important to them,

Can we help in any way with the compiler support? This is probably the no.1 priority for our team, the other one being the matrix_cl. Will this be a flag on the Eigen matrix that will mean this Matrix’s contents are immutable?

I’m thinking the only thing we’d need from the Math library would be the GPU function overloads and a list of them that we can use during compilation. There are a few ideas for how we could use this, perhaps the simplest would be that whenever we see a call to one of these functions with a data or parameter argument, e.g. y ~ bernoulli_logit_glm(x, alpha, beta) where x and y are data, we can create a new variable x_opencl__ = to_opencl(x); (and same for y) in the transformed data block and then replace the call with y_opencl__ ~ bernoulli_logit_glm(x_opencl__, alpha, beta)). I think @Matthijs might have originally come up with this a few months ago. This seems super easy to do and I think it’s fairly general and solves our lowest hanging fruit… definitely open to other ideas!

Would it help if we were present for the meeting you mentioned on discourse?

I think the most helpful thing would be a regularized list of function signatures for GPU functions noting which arguments are overloaded for matrix_cl. That meeting is a little more about how to abstract this idea along with a few other ideas, but in retrospect I don’t think we need to solve that abstraction first.

There’s a few details that need to be filled in:

  1. Are current overloads all-or-nothing? Meaning if x and y are matrix_cl but alpha and beta aren’t in the above example, what happens? I think if so we can just generate to_opencl assignments for everything and then have dead code elimination remove any redundant ones.
  2. Do any of the functions change their arguments?
  3. For functions that return a matrix, is it always a matrix_cl?

I think this plan actually sounds pretty feasible and robust even though it’s extremely simple. I should be able to spend hopefully 2 days a week on this - it’s a really killer feature for the new compiler and really showcases our GPU support so I’m really excited to work on it. If any of y’all are excited to dive into the new compiler, I’d be happy to pair with you on it! My friend has been bugging me to start a twitch stream of me programming; this might be a good first go at it :P

1 Like

Thanks Sean this would be awesome!

I can make a PR that exposes the functions we have that can use the gpu in reverse mode

They are but I think it would be pretty easy for to make ones that take mixed inputs

`cholesky_decompose does, but again think it would be easy to write one that does not. We could also have an inplace method.

Everything in the opencl folder returns a matrix_cl. Would it be easier for them to return an eigen matrix or matrix_cl?

Only it’s in the form of a call in show so I can crank call about how my code always definitely compiles on the first try no matter what. Or a pre taped call in show

Steve has already anwsered some of this in the meantime.

At first glance this should be it yes. We can also try things out playing with the generated C++ with no changes to the compiler so that is great.

@tadej already tried something along these lines for the generated quantities blocks in Gaussian processes and it proved really efficient (adding to_matrix_cl and from_matrix_cl to and before the block in the generated C++ code). It cut down 15-20% of exec. time even though generated quantities is not the bottleneck there.

Will get that ready together with STeve. Just post it here or add via PR to the function signatures file somehow?

All other functions except for the normal_id GLM are all-or-nothing. And they also return matrix_cl.
The only merged GLM is normal_id and the OpenCL calls there are currently inside the prim/mat/prob function.
For this approach to work with GLMs, we will have to change the interface to the GPU GLM function so that at least x and y are matrix_cls.
My 2 cents would be to change it so that x and y are matrix_cls (we have data on the GPU that way) and the GLMs still return the result of operand_and_partials.build(). Does that complicate things? My naive thought is that it doesn’t if we can add the information that the output is back on the CPU somehow? So no from_matrix_cl is needed after it.

For the first showcase of this feature I would target /prim functions and GLMs. Of those, cholesky_decompose is the only function where the arugment is also [out].

Yeah, excited for this also. If we can break the ice on this, entire uninterrupted gradients on the GPU becomes feasible without any weird trickery which is really awesome. Really appreciate your work on this.

I have been following Steve’s work intermitently and will take you up on this offer!

That would definitely make me join Twitch. Though I still think a morning radio show with Steve called “aST and the Refactor” might be more succesful :)

The short term plan on our end is therefore:

  • make a list of functions that compiler would try to use OpenCL with
  • change the interface of any functions not compatible with this approach (cholesky_decompose for example, I think there are not many more)
  • add kernels for simple primitive functions that we didn’t target before and expose them (rep_vector, diag_matrix,…). These are pretty simple, but we haven’t done them before because it made no sense to do them without compiler support.
  • change the interface to the GPU GLMs
  • make a working example by modifying the .hpp with the modified interface
  • continue merging the rest of the GLMs

We should also continue the work on matrixl_cl<var> to enable similar approach for the current GPU supported /rev functions. Not sure how doable this is in this timeframe.

Sean do we need to also have an implementation of matrix_cl<var>? I was working on that for a bit then Rok was working on it. Rok if you made any progress on that and need help lmk! The matrix itself is composed, but I got snagged in passing the results back to a pointer of vars.

I would say we aim to support the overloaded primitive functions and the GLMs which was what we were aiming for anyways. If we add some var stuff that is great but I would not push it too much.

I made some progress but as usual I start jumping from one thing to another. I am now hyped for this solution and will make a push to have something ready this week.

1 Like

We need to figure out how to add it; it probably won’t go in signatures just yet but rather live somewhere in the backend since this is Stan Math specific “instruction selection” sorta.

Gotcha, so it sounds we’ll need to mark the return value as well as being either CPU or GPU, which is fine.

😂😂😂

Hmm, there’s a few ways we can go and I don’t have much experience remote pairing. Let’s try setting a time up in email and maybe trying to pair over twitch (or something like that that we can record).

Y’alls plan sounds good. re: matrix_cl<var> it seems like that could enable doing that “simple approach” above for parameters as well more easily.

Ok, @tadej has started tweaking the GPU GLMs to conform with this plan and we had a quick discussion on this topic and have a first question.

Is it possible for the compiler to select the CPU/GPU version also based on if some input is defined in data or parameters?

We are asking this because in GLMs if x and y are data we can do the to_matrix_cl(x) and to_matrix_cl(y) in the transformed data and then the data stays there and we can have the GLM signature normal_id_glm_lpdf(matrix_cl, matrix_cl, ...) be just an overloaded version of the existing CPU GLM signature.

But if x and y are parameters (which as far us two understand is far from the typical use-case, idk if that is correct?) we cant do that. In that case we would need to transfer the data each iteration and we need to support matrix_cl<var> in order to stay an overloaded version and things start to complicate obviously. The speedups when transferring every iteration will also obviously be much worse. At least until most of the functions have GPU support.

Does deciding based on this complicate things?

Ah, this is a good question… So in general we will have the information available that says where each function argument came from, at least whether it’s data, parameters, or a local variable. We can use that to make all kinds of rules of varying complexity. The first version could be that we only turn on the GPU versions of functions when at least one argument is data, or something like that? But it might be the kind of thing where we actually want information about each GPU function that says what the rule is for switching to the GPU version. We can even have the compiler emit runtime checks for sizes of various matrices if we want to, similar to what is inside many of the GPU functions now (and eventually we’ll add the ability to compile a Stan model for a specific set of sizes, so those can become compile-time checks). You guys will definitely need to be involved in designing that system as I don’t have a good intuition for what those rules would look like or how complicated the system for expressing them needs to be.

Do you want to sketch out a loose spec of it here? do you know the criteria for switching for each function?

We can also just start super simple with the “at least one data arg” approach and still see big speedups in many cases I think. I’m assuming we’ll add a flag to the compiler that turns on the generation of the matrix_cl objects.

Thanks! You are hyping me for real, this solves a bunch of our problems :)

One quick question before I write down the quick spec.
Would a rule “if at least 2 consecutive functions have opencl overloads use them, as long as one of them is from list X” be considered complicated or tough to implement?
List X would be a list of matrix_cl overloaded functions that are faster on the GPU standalone (at some non-exagerated input size). Those are for example multiply, cholesky, mdivide, gp_cov_exp_quad.
We just want to prevent moving to OpenCL just to do diag_matrix(rep_vector(sigma, N1)). But if it was

matrix[N1, N1] K =   cov_exp_quad(x1, alpha, rho)
                         + diag_matrix(rep_vector(square(sigma), N1));

it would be fine.

Otherwise, what I had in mind for the long term would be that before each block of consecutive opencl functions we would emit something like

run_opencl_section_x = should_we_run_section_with_opencl("section x", matrices...)
if(run_opencl_section_x){
  use opencl version
}else{
  use cpu version
}
record_section_time("section x");

This would try out both versions and decide at runtime what to do. This would be interesting for general heterogeneous HPC research. But that is for later. We would want to start simple yes.

1 Like

Interesting, of course this is a very important use case. What if we first went with individual rules per function and then add logic or rules to the system that said something like

  1. “if any of these arguments is already available on the GPU, then switch to that argument”
  2. “if this fn is part of a function call that has a GPU equivalent and the other arguments are already on the GPU, then move this one to the GPU as well”

So in your example above, we might first examine the Plus operation and see that its left child (cov_exp_quad) is worth putting on the GPU (this relies on a rule for cov_exp_quad alone that would claim it is individually worth doing on the GPU - would that exist in most cases like this?). Then we would see that the right child of Plus is also capable of being hoisted to the GPU, do that, and then that would cause Plus itself to be hoisted to GPU (assuming both cov_exp_quad and diag_matrirx return matrix_cls, which would cause the overload for + to use the GPU). (If there is a separate function for addition on the GPU we can also code generate that instead).

Very cool! Might be best to put that entire opencl section into a function then and wrap that function with a higher order function that does the benchmarking… but then we also need to save those results somewhere so we don’t do the benchmarking every iteration…

That seems fine yes.

Sorry, not sure I understand the question. Can you clarify what you mean by “would that exist”? Might be the long day idk :)

Yeah, if you have at least one argument on the GPU, there is no reason to move data back to the CPU except after the last function.

My gut feeling is also that once you have two consecutive functions that can execute on the GPU you should do it every time, no matter the input size. A multiply of [N,N] x [N,N] for instance is faster on a moderate GPU if N > 100. If any of the input arguments dont require transfer there is no doubt this is faster for any N, the same if multiply is followed by a opencl function.

We have operators for +,-,* for matrix_cl and operators for scalar*matrix_cl and matrix_cl*scalar.

We can do almost this entire function in the generated quantities block on the GPU right now (there is a PR for rep_matrix and diag_matrix will also be added this week), except for multi_normal_rng.

vector gp_pred_rng(real[] x2,
                     vector y1, real[] x1,
                     real alpha, real rho, real sigma, real delta) {
    int N1 = rows(y1);
    int N2 = size(x2);
    vector[N2] f2;
    {
      matrix[N1, N1] K =   cov_exp_quad(x1, alpha, rho)
                         + diag_matrix(rep_vector(square(sigma), N1));
      matrix[N1, N1] L_K = cholesky_decompose(K);

      vector[N1] L_K_div_y1 = mdivide_left_tri_low(L_K, y1);
      vector[N1] K_div_y1 = mdivide_right_tri_low(L_K_div_y1', L_K)';
      matrix[N1, N2] k_x1_x2 = cov_exp_quad(x1, x2, alpha, rho);
      vector[N2] f2_mu = (k_x1_x2' * K_div_y1);
      matrix[N1, N2] v_pred = mdivide_left_tri_low(L_K, k_x1_x2);
      matrix[N2, N2] cov_f2 =   cov_exp_quad(x2, alpha, rho) - v_pred' * v_pred
                              + diag_matrix(rep_vector(delta, N2));
      f2 = multi_normal_rng(f2_mu, cov_f2);
    }
    return f2;
  }

In this case you would start with the + in the first line right? You find cov_exp_quad → use openCL. for the diag_matrix(rep_vector(square(sigma),N1))) we would have to transform to diag_matrix(rep_vector(to_matrix_cl(square(sigma)),N1))) We have the constructor for matrix_cl for scalar, can that be used, otherwise we can add a to_matrix_cl for a scalar, that is nbd.

All the remaining function then have matrix_cl overloads until you encounter multi_normal_rng which has to be changed to f2 = multi_normal_rng(from_matrix_cl(f2_mu), from_matrix_cl(cov_f2));

Am I making any sense here? :) We are operating in the double world of the Stan Math which makes life easier, but you have to start somewhere.

1 Like

Maybe a map in the opencl_context? Would that be ok?

@tadej has as PR up that changes the normal_id GPU GLM function to the new signature where x and y are matrix_cls https://github.com/stan-dev/math/pull/1358

With that branch active on Math I tried manually changing the source from the compiler. With the changes here https://www.diffchecker.com/ekiNM0rn I get the expected results (execution time on medium sized data GLM for 200 warmup and 200 sampling iterations goes down from 330s to 80s with 50 seconds spent on IO).

I did have to change the matrix_cl so that the rows and cols members are not const so I could not overwrite those values in the move constructor:

  /**
   * Move a \c matrix_cl to another
   */
  matrix_cl<T>& operator=(matrix_cl<T>&& a) {
    // Need to wait for all of matrices events before destroying old buffer
    this->wait_for_read_write_events();
    buffer_cl_ = a.buffer();
    view_ = a.view();
    write_events_ = std::move(a.write_events_);
    read_events_ = std::move(a.read_events_);
    rows_ = a.rows();
    cols_ = a.cols();
    return *this;
  }

I had to do this in order to redefine the global matrix_cl<double> y_cl; and matrix_cl<double> X_cl;.Otherwise it kept telling me that the number of rows and cols dont match. Is there a better solution to this @stevebronder ?

1 Like

Where is that move used? I’m fine with it though It’s kind of odd to me that we do a move here into an object that’s not the same size. Where is that being used in the code?

I modified the move constructor.

Oh no sorry I meant where is this particular move happening that’s causing the error?

See here: https://www.diffchecker.com/ekiNM0rn

Edit: that is the compiler output hpp file.

Ah! So yeah we could remove the const there the move assignment moves everything over so I think it’s fine. Or could it use a constructor? X_cl(X);. If that’s annoying then I’m fine with making in non-const. Let’s check the accessor for both of them is const and that should be fine