OpenCL + MPI

@stevebronder and I were having a discussion that ended up veering into what combining MPI and GPU would look like. I’m going to try to summarize and then continue here. We had a few open questions:

  1. Is the only use case for MPI + GPU for when your GPU doesn’t have enough RAM to compute your entire likelihood at once?
    1. In broad strokes, with GPUs it seems like you gain heavily from increased vectorization (which also reduces # of transfers). So likely what would be fastest is 1 thread with everything vectorized instead of 4 threads with the same workload repeated 4 times on smaller subsets of data. The latter is only necessary if we don’t fit on the GPU. If this is true, I don’t think you’d use MPI + GPU for speed in normal models, you’d just use it when your dataset doesn’t fit in RAM. In which case the async thing probably would help if the multiple MPI processes using the GPU don’t automatically get benefits from async by being in different processes.
  2. Async IO for GPUs should let you do reads and writes at the same time, which Steve illustrated here:
    image questions here: can two processes use the GPU at the same time? If so, do they need to use async OpenCL in order to get the benefits described above? Naively it would seem the GPU could schedule things appropriately, if this is even possible.

cc @rok_cesnovar

I’m at work but will post more in depth about how I think OpenCL + threading/mpi will work when I get home

Looks like my post sean screen shot got cut off a bit so going to post it below. @rok_cesnovar lmk if I’m confusing anything below but I’m pretty sure it’s right.


With the sequential queue we use right now there’s a lot of throughput to and from the single gpu being left on the table! With async stuff the gpu can read and write at the same time, both to it’s internal memory and to the host. Say we have 2 threads 1 and 2. Right now we do an in-order queue so multiplying A and B on both threads can worse scenario look something like:

{In the below (i) is the thread and Ai, Bi, Ci are the associated matrices for that thread}

(1) Copy(A1) -> (1) Copy(B1) -> (1) Allocate(C1) -> (1) C1 = multiply(A1, B1) ->
 (1) CopyToHost(C1) -> (2) Copy(A2) -> (2) Copy(B2) -> (2) Allocate(C2) ->
 (2) C = multiply(A2, B2) -> (2) CopyToHost(C2) -> DONE 

But since the gpu can do that read/write and computation at the same time we could have something like this with out of order:

{The [func(), func()] means two operations happen at the same time}

(1) Copy(A1) -> (1) Copy(B1) -> 
(1,2) [Allocate(C1), Copy(A2)] ->
(1,2) [C1 = multiply(A1, B1), Copy(B2)] ->
(1,2) [CopyToHost(C1), Allocate(C2)] ->
 (2) C = multiply(A2, B2) -> (2) CopyToHost(C2) -> DONE 

So the gpu has a much higher throughput and usage in the above since it will often be reading and writing at the same time. With the out of order queue we need to attach an ‘event’ list to each data transfer and kernel call. This makes sure that the GPU knows that it can’t execute a kernel until the data that kernel needs has finished transferring over to the gpu.


wrt to (1) and (2) from Sean

That’s one possible use. The other example I see for the out of order que + mpi being useful is when you have large groups you want to do a cholesky or multiply for. i.e. if you had 20 groups of size 5k than you can run those in batches with MPI. Then you get those benefits of read / write at the same time.

Yep!. Since the context is a singleton, two processes*** would be submitting jobs to the same command queue for the GPU. The GPU doesn’t care that there’s two processes, it just knows it’s receiving two jobs and needs to know where to send those jobs back to.

One thing I don’t know is, if the MPI instance was a cluster of computers, how does a singleton work with that? Would they all share that singleton over the program or would each worker in the cluster have their own singleton (that would be odd and make no sense to me though).

Maybe we don’t need to think about clusters until we handle multiple gpus.

Yes you need async for the above. The OpenCL context by default does a first in first out schedule, so it won’t optimize like the above. If you have two reads you submit followed by a write, OpenCL by default will do the reads first then the write. But with the out of order queue and event handling we can tell OpenCL, “This write is not dependent on these reads so do it now.”


I probably need to spend more time reading our MPI and threading implementations. Hopefully next week I’ll have time to build out an mvp of the out of order que and we can see if or what situations we get speedups.

*** Because I don’t have an answer to how clusters work, assume when I say processes I’m effectively talking about multiple threads on the same computer

I just want to highlight again that processes and threads are different and that using multiple processes, you will not have access to literally the same OpenCL CommandQueue object, but you will be running on the same computer trying to use the same GPU. It looks like you can share a GPU with separate processes, so that’s good. My question is then: Does that sharing bring us the same async benefits, such that we don’t actually need to code async stuff because we get it for free when two processes are running and being time multiplexed by the gpu driver?

I dont think you are confusing anything. The example you showed should have a bigger throughput.

Ah, apologies! I was using the word processes wrong

Mby? That makes sense to me. If we had 4 chains running they will each have their own context / queue for a single GPU. Yeah it makes sense that once one processes starts computing it would let another process start reading in data.

MPI is for any time you can cleanly divide up the expression graph into independent subgraphs that can be farmed out. GPU is for any time you have big matrix ops. They’re independent, but can be combined when you have multiple cores, each with its own GPU support (as you can get, say, on AWS or on our cluster here at Columbia).

For example, if you have a huge N x K matrix and a K x 1 coefficient vector, you can divide into 100 shards of size N / 100, which can still use GPUs.

The killer example is ODEs, which we can’t solve using GPUs yet.

Exactly.

Bob, the question here revolves around the OpenCL async API which is relevant not for multiple GPU setups but for multi-process single-GPU setups. I was asking a more nuanced question than the one you answered: Given that GPUs performance scales directly and only with vectorization (i.e. sending more data to the GPU at once to be processed together), it often doesn’t make sense to have multiple processes using a single GPU if you could instead coordinate and have all of the data put on the GPU together (like we can).

Thanks for explaining.

Let’s meet up irl this week and chat about this since I’m not totally sure how this would work.

1 Like

Sounds good, looking forward to it. If anyone else is interested PM me or Steve and we’ll try to get something together this week.