For MPI I am essentially relying on singletons just as you do. So boost MPI provides a communcator object which it automatically manages for me. This communicator “knows” within each process which rank it is. The root node has rank 0 and all workers have a rank from 1 - #number of workers-1.
The other aspect for which I introduced additional singletons is for “static data”. Since the main bottleneck for MPI is the communication between the processes, my goal is to minimize the data exchange. Hence, I have introduced the concept of static data which gets transmitted only once from the root to all the workers and after that first call to a function, the static data is cached locally on the nodes and it is never transmitted again.
The static data concept could be of interest to GPU computing as well, I suppose. Do you have that already in mind? I can point you to code parts within the MPI codebase which handle this bit and hopefully this does make sense for GPUs as well.
Now, from my perspective MPI and GPU stuff can nicely coexists if the GPU (or multi GPU) operations are just happening within a given process. So there are these use cases I could imagine:
- per MPI process we have one (or multiple) GPUs, then the GPU operations can be part of what is happening inside a
map_rect call which we dispatch to the workers over MPI
- we only have on the root one (or multiple) GPUs, but the workers do not have a GPU; then the GPU operations need to run on the root without any MPI parallelization
- one could imagine to split a huge GPU operation over multiple machines by using MPI within the GPU operation. I do not think that this is very useful due to the likely very large communication burden. Also, such an operation is likely feasible to split up such that it falls under the bucket of 1.
Now, the question is what does this mean for programming this? The MPI jobs itself do only know their rank, but any other info (on what machine, which GPU to use, etc.) would need to come from environment variables which can hopefully be set by using hooks which are part of the MPI startup process (or we use some stan configuration thing just as we configure the sampler, etc.). That would probably mean that the MPI initialization scheme and the GPU resource allocation need to be aligned in some way. For scenario 1 every MPI process would try to grab a GPU. For scenario 2 only the root would grab a GPU and 3 is not in scope, I think. That would also mean for the GPU code that the decision for using the GPU or not is a runtime decision and not a compile time only thing.
My expectation is that scenario 2 is the most common one since most GPU operations are good whenever lots of parameters are involved (is that right?). Shipping these between processes using MPI would be inefficient.