@wds15 how does the master slave relationship work with a singleton in global on the master? Is that passed automatically or no?
We are currently making the OpenCL context into a singleton with lazy initialization**. What I would imagine is that the process is passed to a slave and does not pass the already created OpenCL context if master created one. I think this is what I want. So that when the slave calls a GPU function it will create the context on that slave and compile the kernels it needs.
I don’t know much about clustering with MPI, so any context on how you think we can make these two projects work together would be awesome. Dan asked me my vision last night for the GPU project and I agreed with what Bob said earlier. I want users to be able to execute a stan program over a cluster with multiple GPUs. That’s pretty exciting to think about!
** The context is created the first time a GPU function is called and kernels are compiled in groups (ie if a user calls add_gpu the kernels for add_gpu and subtract_gpu are also called)
For MPI I am essentially relying on singletons just as you do. So boost MPI provides a communcator object which it automatically manages for me. This communicator “knows” within each process which rank it is. The root node has rank 0 and all workers have a rank from 1 - #number of workers-1.
The other aspect for which I introduced additional singletons is for “static data”. Since the main bottleneck for MPI is the communication between the processes, my goal is to minimize the data exchange. Hence, I have introduced the concept of static data which gets transmitted only once from the root to all the workers and after that first call to a function, the static data is cached locally on the nodes and it is never transmitted again.
The static data concept could be of interest to GPU computing as well, I suppose. Do you have that already in mind? I can point you to code parts within the MPI codebase which handle this bit and hopefully this does make sense for GPUs as well.
Now, from my perspective MPI and GPU stuff can nicely coexists if the GPU (or multi GPU) operations are just happening within a given process. So there are these use cases I could imagine:
per MPI process we have one (or multiple) GPUs, then the GPU operations can be part of what is happening inside a map_rect call which we dispatch to the workers over MPI
we only have on the root one (or multiple) GPUs, but the workers do not have a GPU; then the GPU operations need to run on the root without any MPI parallelization
one could imagine to split a huge GPU operation over multiple machines by using MPI within the GPU operation. I do not think that this is very useful due to the likely very large communication burden. Also, such an operation is likely feasible to split up such that it falls under the bucket of 1.
Now, the question is what does this mean for programming this? The MPI jobs itself do only know their rank, but any other info (on what machine, which GPU to use, etc.) would need to come from environment variables which can hopefully be set by using hooks which are part of the MPI startup process (or we use some stan configuration thing just as we configure the sampler, etc.). That would probably mean that the MPI initialization scheme and the GPU resource allocation need to be aligned in some way. For scenario 1 every MPI process would try to grab a GPU. For scenario 2 only the root would grab a GPU and 3 is not in scope, I think. That would also mean for the GPU code that the decision for using the GPU or not is a runtime decision and not a compile time only thing.
My expectation is that scenario 2 is the most common one since most GPU operations are good whenever lots of parameters are involved (is that right?). Shipping these between processes using MPI would be inefficient.
Apologies for the delay, I wanted to spend time catching up on all the MPI conversation before I replied
Would love to see that! Our next PR is going to be working with data so this would be super nice to see.
per MPI process we have one (or multiple) GPUs, then the GPU operations can be part of what is happening inside a map_rect call which we dispatch to the workers over MPI
we only have on the root one (or multiple) GPUs, but the workers do not have a GPU; then the GPU operations need to run on the root without any MPI parallelization
one could imagine to split a huge GPU operation over multiple machines by using MPI within the GPU operation. I do not think that this is very useful due to the likely very large communication burden. Also, such an operation is likely feasible to split up such that it falls under the bucket of 1.
(1) and (2) would be handled in the stan language where users call func_gpu(blah) (2)
vs. map_parallel_rect(func_gpu, blah, blah) (1)
Is that correct?
That would probably mean that the MPI initialization scheme and the GPU resource allocation need to be aligned in some way.
I think this would mean the GPU configuration initialization has to be fully lazy. So that when a GPU function is called on a slave it will look for a GPU configuration and if one doesn’t exist then build the config. Otherwise you would have to have call the GPU configuration initializer with MPI on each slave before calling any other GPU functions. (Assuming that instance would continue to exist on each slave)
My expectation is that scenario 2 is the most common one since most GPU operations are good whenever lots of parameters are involved (is that right?)
That’s correct! I have a feeling people are going to be interested in (1) tho’. With static data would those transfers be as painful?
Technically I am creating a type for the static data and things are stored using static member variables.
Yes, correct.
What if multiple MPI processes run on a single machine with many cores and many GPUs… in that case we need to somehow tell the system how things are allocated. Case 2 is easy… only GPU calls in the root process. Case 1 is more involved as then it must be clear which MPI process is grabbing which GPU. I have read somewhere that CUDA can manage this, but this is all beyond me.
Static data is essentially for free in the MPI setting. Static data is sliced on the root into the correct chunks and then distributed to the workers - a single time only. This is kind of a lazy evaluation scheme.