Looks like my post sean screen shot got cut off a bit so going to post it below. @rok_cesnovar lmk if I’m confusing anything below but I’m pretty sure it’s right.
With the sequential queue we use right now there’s a lot of throughput to and from the single gpu being left on the table! With async stuff the gpu can read and write at the same time, both to it’s internal memory and to the host. Say we have 2 threads 1 and 2. Right now we do an in-order queue so multiplying A and B on both threads can worse scenario look something like:
{In the below (i)
is the thread and Ai, Bi, Ci
are the associated matrices for that thread}
(1) Copy(A1) -> (1) Copy(B1) -> (1) Allocate(C1) -> (1) C1 = multiply(A1, B1) ->
(1) CopyToHost(C1) -> (2) Copy(A2) -> (2) Copy(B2) -> (2) Allocate(C2) ->
(2) C = multiply(A2, B2) -> (2) CopyToHost(C2) -> DONE
But since the gpu can do that read/write and computation at the same time we could have something like this with out of order:
{The [func(), func()] means two operations happen at the same time}
(1) Copy(A1) -> (1) Copy(B1) ->
(1,2) [Allocate(C1), Copy(A2)] ->
(1,2) [C1 = multiply(A1, B1), Copy(B2)] ->
(1,2) [CopyToHost(C1), Allocate(C2)] ->
(2) C = multiply(A2, B2) -> (2) CopyToHost(C2) -> DONE
So the gpu has a much higher throughput and usage in the above since it will often be reading and writing at the same time. With the out of order queue we need to attach an ‘event’ list to each data transfer and kernel call. This makes sure that the GPU knows that it can’t execute a kernel until the data that kernel needs has finished transferring over to the gpu.
wrt to (1) and (2) from Sean
That’s one possible use. The other example I see for the out of order que + mpi being useful is when you have large groups you want to do a cholesky or multiply for. i.e. if you had 20 groups of size 5k than you can run those in batches with MPI. Then you get those benefits of read / write at the same time.
Yep!. Since the context is a singleton, two processes*** would be submitting jobs to the same command queue for the GPU. The GPU doesn’t care that there’s two processes, it just knows it’s receiving two jobs and needs to know where to send those jobs back to.
One thing I don’t know is, if the MPI instance was a cluster of computers, how does a singleton work with that? Would they all share that singleton over the program or would each worker in the cluster have their own singleton (that would be odd and make no sense to me though).
Maybe we don’t need to think about clusters until we handle multiple gpus.
Yes you need async for the above. The OpenCL context by default does a first in first out schedule, so it won’t optimize like the above. If you have two reads you submit followed by a write, OpenCL by default will do the reads first then the write. But with the out of order queue and event handling we can tell OpenCL, “This write is not dependent on these reads so do it now.”
I probably need to spend more time reading our MPI and threading implementations. Hopefully next week I’ll have time to build out an mvp of the out of order que and we can see if or what situations we get speedups.
*** Because I don’t have an answer to how clusters work, assume when I say processes I’m effectively talking about multiple threads on the same computer