Stanc3 Math lib opencl integration

Would we be able to write such a rule that could act on an individual function basis such that the behavior we want emerges in most cases?

This all seems right to me. We would probably be keeping track of which vars have an _opencl__ equivalent available already. For returns from opencl functions, it seems like we might want to first assign to e.g. K_opencl__ and then also generate a K = from_matrix_cl(K_opencl__); (did I get the names right this time?) that our dead code elimination can remove if it isn’t used anywhere because we’ve updated the rest of the code to use the opencl version.

That’s not unreasonable. Once we’re closer to implementation of that stuff, we should float that idea past Bob and others first before going with it.

Yes, I believe that is doable. Once I have a list set up, I will try to write that down. There are 2 or 3 more Math PRs that need to be merged and then everything is ready on the Stan Math side. Tadej will also follow with the rest of the GLMs but we can work with the normal_id_glm.

Nice. That seems like a great and clean solution. And yes, you got it right :)

1 Like

@seantalts know you are pretty busy, hopefully I am not bugging you too much with this.

The Math PRs that were merged now make the normal_id_glm_lpdf example with the simple changes to the compiler output (link again so you dont need to scroll) work on develop. It also makes all other functions with the correct function signature work but I would like to tackle GLMs first as these are low hanging fruits.

So I started looking at stanc3, I know my way around it a bit now, tried changing stuff and at least some of it worked :) These changes should probably not be a challenge at the level of adding a type, so hopefully I am able to contribute here.

So for now I would like to narrow down where these things will happen so I can narrow my focus. Looking at the Distinct Stanc Phases this is going to happen at 7 or 7A right? 8 is basically a printer of compilable C++ code yes?

This is a pretty high priority for me! I have a lot on my plate and have been having trouble juggling / prioritizing but I just figured out a good ordering last night. I also have been sick the past 3 days, but today I’m starting to feel better.

First, that’s awesome! That was fast. And yes, the way we’ve been talking about it, these changes would go into the phase currently numbered at 7 (“Backend MIR->MIR transform in stan_math_backend/Transform_Mir.ml”). It may not fit into any of the passes currently in there; I think it’s fine if it’s a new pass.

So I’m thinking now we’re talking about 2 criteria and thus 2 lists of functions:

  1. functions we want to convert to opencl functions if any args are data
  2. functions we want to convert to opencl functions if any args are already available in opencl memory

Is that true and complete for the first pass?

I think we can have a single fold over expressions that will take care of any expressions in one go. But then I think we’ll need to run this iteratively over statements (to find any additional expressions we should convert) until the list of opencl vars stops growing.

Does that make sense?

Secondly, I’d be happy to pair with you over twitch or something to try to give you a C++ programmer’s introduction to OCaml + compilers + our compiler, if you are interested and don’t mind us recording it. It seems like it would be fun, might attract folks to the project, and would be useful material for onboarding future folks (at least until we refactor everything, lol). If that sounds fun or useful, are you free sometime tomorrow after 6pm your time (lol), Monday after 5pm, or Tuesday after 3pm? The six hour time difference does us no favors, sorry about that!

Yes, this is probably close. I am still thinking about this, will draft something more exhaustive tonight or tomorrow morning, I keep trashing everything I come up once I sleep on it.

Yes, no problem, should be fun. The camera does not love me, but that will be the viewers problem :) I am fine with it, it would not make sense for you to make all that effort and not re-use if its possible.

I am fine with afternoons, I work from home most days except Mondays and I usually do Stan or Stan adjacent stuff in the early mornings and afternoons/evenings.I would maybe aim for Monday/Tuesday so I have some more time to prepare and I dont embarrass myself too much. We can fine-tune scheduling via e-mail, I am pretty flexible.

1 Like

Here is a draft of my idea. In the end it got a bit more complicated. This is how I imagine a MIR optimization works or could work. Please let me know if my thinking is reasonable and doable and at least remotely how MIR optimization would or could work.

  1. We start by adding X_opencl = to_matrix_cl(X) to the transformed data for all non-scalar data (X,Y,Z) variables. If they wont be used it gets removed by the dead code elimination right?

  2. We then pass through the functions multiple times flagging them with CL if we want a matrix_cl overloaded function. We have the following lists of functions (I am bad at naming so the names are probably bad):

Selfish functions (S)

Functions that only help themselves, consume matrix_cl arguments but return CPU values so the next functions dont get any help from them. For now these are GLMs. These are executed on the GPU if the arguments (either all arguments or a approved selection of arguments) are already on the GPU.

Fast functions (F)*

These are functions that will lead the move of the execution on the GPU. matrix_cl overloaded function will be used if any of their arguments are already matrix_cl or if its result will definitely be consumed by a matrix_cl overloaded function.Once we implement the auto-selection these will get opencl overloads in every case, but not for now. Examples are mdivide_left_tri_low, mdivide_right_tri_low, multiply, cholesky_decompose, cov_exp_quad

Opportunistic functions (O)

Functions that will only be used in cases when they will not increase the data transfer. Example is rep_matrix that can reduce the transfer from NxM to 1. We would use that overload always when An example would also be A = add(B,C) where we would use a matrix_cl overload if B and C are matrix_cl or if either A & B and B & C are on the device. Would it work if we labeled them like below where 1 means that the returned value is consumed 2+ are arguments.

add (1,2), (1,3) or (2,3)
tranpose 1, 2
subtract (1,2), (1,3) or (2,3)
diag_matrix 1 (the consuming function must be on the device, otherwise we make things worse)
rep_matrix 1

In each pass we then execute the following rules:

S:

  • arguments are matrix_cl? -> CL

F:

  • any arguments matrix_cl -> CL
  • is the consuming function F or CL? -> CL

O (CL and F neighbours are treated the same):

  • does the function satisfy any in the list -> CL

Every time we do a “->CL” we add a from_matrix_cl call. Redundant ones would be remove by dead code elimination.

If we can at least get the S part implemented for the next release assuming Stanc3 will get in, that would be fantastic as the GLMs are the most low hanging fruits here and there has bit some interest in getting them supported.

If we take a part of the example above

matrix[N1, N1] K =   cov_exp_quad(x1, alpha, rho)
                         + diag_matrix(rep_vector(square(sigma), N1));
matrix[N1, N1] L_K = cholesky_decompose(K);

vector[N1] L_K_div_y1 = mdivide_left_tri_low(L_K, y1);
vector[N1] K_div_y1 = mdivide_right_tri_low(L_K_div_y1', L_K)';

We start with empty:

The passes would then go:

Nothing else is added after this pass so we stop. Am I going in the right direction at all?

I didn’t understand this method of labeling, sorry!

This one could be annoying to implement because it looks at the parent possibly before we do anything to the parent.
I think another way to frame this is, for any Selfish, Fast, or Opportunistic function we look at its arguments. If any of the arguments are CL, then we convert the other arguments to CL. I think we have to do this anyway and it was kind of implicit. If we’re doing this, then we don’t need to explicitly examine parents or neighbors I think. This way we can keep the pass bottom-up, though the “convert args to opencl” will be a sort of bubble-down pass within the bottom up pass. Convert args will traverse down until it reaches some expressions that wouldn’t be converted to opencl by matrix_cl arguments (e.g. a literal, a local variable, a function that isn’t S, F, or O, etc) and we wrap that with to_open_cl.

Re the ordering - we don’t have code right now to end up with a full tree like what you’ve illustrated (nice!). Those 4 lines you have written will be separate statements in a list of statements. So we’ll need to process that statement by statement and keep track of some global state, e.g. which variables have opencl equivalents. So I think we’d probably start with the cov_exp_quad line and go bottom up, noticing we can convert cov_exp_quad, doing nothing to diag_matrix, then getting to the + node and seeing one arg is already opencl and it’s O so we should convert the remaining args as well. That will bubble down into diag_matrix until it gets to a non-opencl expression and wrap that in to_opencl (or if it’s a var, check if we have already converted it.

I’m not sure we need to generate the conversions of all data and transformed data - using this algorithm I think we’d end up with the appropriate list anyway…? Not sure. We’ll probably have to play around a lot with it anyway.

If you are thinking about C++ compiler doing the elimination you are probably wrong. My guess is that OpenCL stuff is opaque to C++ compiler so it can not tell if certain buffer is used or not. However, if stanc3 does any dead code elimination all this is probably fine.

I only half understand the rest of discussion, but it is great we are moving forward.

Yep, stanc3 dead code elimination. It may need a strength upgrade to achieve this but it should be pretty doable with the information we have.

Another long winded post, sorry :) i really just need to start coding this up and see what we are missing. Hopefully tomorrows twitch session will get me jump start this.

I did give a bad explanation there, sorry. In the case of A=add(B,C), A=1, B=2 and C=3. But that is irrelevant now, please ignore.

Ah, I should have asked before diving in the tree structure. I just wanted to get something out there and the idea sounded OK in my head.

I will try again with lists and the same example again. Since I am not really sure how you handle the case of temporary variables I am going to simplify this a bit by assigning names to everything.

matrix[N1,N1] K_1 = cov_exp_quad(x1, alpha, rho)

real sq = square(sigma)

vector[N1] r_v_1= rep_vector(sq, N1)

matrix [N1,N1] K_2 = diag_matrix(r_v_1);

matrix[N1, N1] K = K_1 + K_2;

matrix[N1, N1] L_K = cholesky_decompose(K);

vector[N1] L_K_div_y1 = mdivide_left_tri_low(L_K, y1);

matrix[N1, N1] L_K_div_y1_T = L_K_div_y1’;

vector[N1] K_div_y1 = mdivide_right_tri_low(L_K_div_y1_T, L_K);

row_vector[N1] K_div_y1_T = K_div_y1’;

All indexes use the same convention. 0 denotes the result and function arguments start at 1. I use the term arguments for both in the following text. So for A = add(B,C) 0 = A, 1 = B, 2 = C.

We then have a predefined map of all opencl supported functions with the following information:

  • cl_args: lits of indexes of arguments that are matrix_cl. If zero is not present that is what I previously called a »selfish« function.

  • req_cl_args: list of lists of indexes of arguments that must already be on the device in order to »approve« this function. A function matches if it matches any of the lists.

  • fast: bool, denotes that the function is considered to be »fast« (really need a better name here). If a function is considered fast its arguments (and result) are put in the »maybe« list if the conditions to be moved to the OpenCL are not met.

We will also need to somehow denote if the function supports only doubles or also vars, this is omitted here for now as only var supported matrix_cl overload is the GLMs.

Here a few examples.

normal_id_glm: cl_arg = [1,2], req_cl_arg = [1,2], fast

multiply: cl_arg = [0,1,2], req_cl_arg = [[0], [1], [2]], fast

cov_exp_quad: cl_arg = [0,1] (sigma and length scale are non-matrix-cl), req_cl_arg=[[0], [1]], fast

add: cl_arg = [0,1,2], req_cl_arg = [[0,1], [0,2],[1,2]]

rep_matrix: cl_arg = [0,1], req_cl_arg=[0]

diag_matrix: cl_arg = [0,1], req_cl_arg = [0]

mdivide_left_tri_low: cl_arg = [0,1,2], req_cl_arg = [[0],[1],[2]], fast

mdivide_right_tri_low: same as mdivide_left_tri_low

cholesky_decompose: cl_arg = [0,1], req_cl_arg = [[0],[1]], fast

transpose: cl_arg = [0,1], req_cl_arg = [[0],[1]]

And then we have the 2 global lists of arguments that were moved to the device or are a good candidates to be moved. Both are empty at the start. Lets name them var_CL and maybe_CL

We start by adding all data to var_CL. In this example there is not data any so we continue.

Then we start with the passes over the functions:

  • check which arguments are in var_CL and maybe_CL (treat them both the same)

  • if the conditions are met (matches in req_cl_arg):

    • put all arguments in var_CL

    • remove the moved arguments from maybe_CL

  • if the conditions are not met and the function is “fast” put the variables not in var_CL or maybe_CL, otherwise just go to the next function

Repeat this until var_CL stops growing. Clear maybe_CL after each pass.

First pass:

  • We start with K_1 = cov_exp_quad(x1, alpha, rho). The list of arguments on the device is empty so no match. The function is »fast« so we add both arguments (0 and 1) to maybe_CL.

    var_CL = []; maybe_CL = [x1, K_1]

  • real sq = square(sigma) no match in the map for square so continue

  • vector[N1] r_v_1= rep_vector(sq, N1) no matching arguments on the device, continue

  • matrix [N1,N1] K_2 = diag_matrix(r_v_1); same

  • matrix[N1, N1] K = K_1 + K_2; Only K1 is in »maybe« → [1], would need to match any list in [[0,1], [0,2],[1,2]] so moving on

  • matrix[N1, N1] L_K = cholesky_decompose(K); no matches but »fast« so add to maybe_CL

    New state: var_CL = ; maybe_CL = [x1, K_1, K, L_K]

  • vector[N1] L_K_div_y1 = mdivide_left_tri_low(L_K, y1); matches because [1] matchs in [[0], [1]]

    New state: var_CL = [L_K, y1, L_K_div_y1]; maybe_CL = [x1, K_1, K]

  • matrix[N1, N1] L_K_div_y1_T = L_K_div_y1'; matches because L_K_div_y1 is on the device

    New state: var_CL = [L_K, y1, L_K_div_y1, L_K_div_y1_T]; maybe_CL = [x1, K_1, K]

  • vector[N1] K_div_y1 = mdivide_right_tri_low(L_K_div_y1_T, L_K); matches because both input are on the device

    New state: var_CL = [L_K, y1, L_K_div_y1, L_K_div_y1_T, K_div_y1]; maybe_CL = [x1, K_1, K]

  • row_vector[N1] K_div_y1_T = K_div_y1'; matches

    New state: var_CL = [L_K, y1, L_K_div_y1, L_K_div_y1_T, K_div_y1, K_div_y1_T]; maybe_CL = [x1, K_1, K]

Second pass:

  • Clear maybe_CL

  • K_1 = cov_exp_quad(x1, alpha, rho) again adds K_1 and x1 to the maybe_CL list.

    New state: var_CL = [L_K, y1, L_K_div_y1, L_K_div_y1_T, K_div_y1, K_div_y1_T]; maybe_CL = [x1, K_1]

  • real sq = square(sigma) same as first pass

  • vector[N1] r_v_1= rep_vector(sq, N1) same as first pass

  • matrix [N1,N1] K_2 = diag_matrix(r_v_1); same as first pass

  • matrix[N1, N1] K = K_1 + K_2; Only K1 is in »maybe« so again moving on

  • matrix[N1, N1] L_K = cholesky_decompose(K); L_K is in var_CL so adding K to it

    New state: var_CL = [L_K, y1, L_K_div_y1, L_K_div_y1_T, K_div_y1, K_div_y1_T, K]; maybe_CL = [x1, K_1]

  • The rest is already on the device so just goes through.

In the next pass the K_1 and K_2 are added. Then in the next pass r_v_1 and in then sq. And it stops there.

After the last pass we go through it again adding to_matrix_cl on the firt occurence of all arguments in var_CL. Once we find an occurence we delete them from the list.

Its not a perfect solution by any means, but its a start, maybe :)