MPI Design Discussion


At that link it suggests gatherv does not need to know in advance - that the children can send their sizes individually.


Hmm… I only got the API working for with knowing the sizes in advance. In any case, we still need to the output sizes per job - otherwise we cannot make sense of the essentially ragged matrix which we are getting back. I don‘t see any way around communicating this at some point.


The gatherv API call at the link communicates the sizes from the children to the root.

int MPI_Gatherv(const void *sendbuf, int sendcount, MPI_Datatype sendtype,
                void *recvbuf, const int *recvcounts, const int *displs,
                MPI_Datatype recvtype, int root, MPI_Comm comm)

Not super familiar with this, but recvcounts and displs are both declared const in the call, so I suspect the children don’t actually send their sizes (cause where would that get written?).

I guess it seems strange that the asymmetry here makes room for a bug (worker sends more data than host is ready for)? But I think MPI is usually more manual than automatic.


I’m not sure I understand what you mean… According to the doc on that page, the workers and root all call this and the workers fill in sendbuf and sendcount and behind the scenes MPI transforms that into recvbuf and int *recvcounts.


behind the scenes MPI transforms that into recvbuf and int *recvcounts.

Recvcounts is listed as an input here:


What do they mean by “input” here?


I’m assuming input in the Fortran sense that the MPI_gatherv function isn’t gonna write to it (so the signature for the C is something like: const int recvcounts[]).


I am not sure that’s what they mean there. I think this example might show it being used as I mentioned:


In the example you point out they do exactly the two-step procedure which I was referring to. There are two gather calls. First a call to collect the string length from each node and then a gatherv call to actually collect the data. Knowing the sizes in advance allows us to save the first MPI gather call.


You guys are right!

How bad is the performance? I’d rather take care of this kind of stuff than expose it to the user if we can without much penalty. I suspect the extra network call adds a maximum of a dozen milliseconds to each iteration within anything resembling a datacenter, right?


This looks like a reasonable way to test how long gatherv latency is on your cluster

Though I have yet to find anything that actually lists specific results except one that said it depended on some specific hardware and quoted numbers around 0.03 vs 0.1ms:


To avoid the extra overhead we can

  1. force the user to tell us the output sizes… where I really do not see the problem as the output will be ragged such that the user must know the ragged structure in advance since he cannot make any sense of the output otherwise.

  2. we can cache the output sizes after the first invocation for all future calls.

I don‘t see a sensible use case for all of this where the output sizes vary from call to call.

There will be a performance penalty. For me it will be not so large given that I have infiniband available…but good old gigabit ethernet is a different story. In short the penalty is there and it will depend on a number of things which determine how bad it will be. Given the millions of calls I anticipate that it adds up.


I like the 2nd option! The user must know all kinds of things but if we don’t have a compelling case for collecting that information I’d avoid eliciting it as a general rule. I was asking about benchmarks to figure out exactly how strong the case was, but it seems like with your option 2 it’s definitely negligible, right?


Of course! My second option would make output size determination a once only operation. The cost of that is essentially zero; it‘s just a bit more tedious to program, but certainly doable.


Those static local variables Bob mentioned might make the programming a bit easier - C++ is constantly surprising me, haha.


So here is one attempt to put our discussions into dummy code:

We need to probably revise how we cache the data locally. Since the mpi_parallel_call class needs template arguments to work, we would as of now cache the data for every of the type combinations…which we probably do not want.

Anyway, have a look at the code and feel free to edit / comment in the branch.


Bump. After a bit of iteration, I think the current concept code linked above should be able to accommodate flexibly map_rect and map_rect_lpdf calls. Some things may certainly be improved which are marked by comments. I hope it makes sense to you , @seantalts.


Some high level comments:

  1. Where does (or will) the cache class get used?
  2. A bunch of it is still pretty specific to the fact that you’re using a hierarchical model (thetas and etas everywhere, not just the names but that you’re always passing two arguments for the collection). There can be a hierarchical model wrapper on top, but think of the map and reduce layer as being a general framework for parallel computation. In other languages with reduce that I’ve seen the reduction operation takes a single element from a collection. That element can itself be a tuple of arguments to be passed to some function internally. So on top of reduce we can build map_rect (our version for hierarchical models with rectangular data) and put the shared and local params in with data in a collection of tuples of references.
  3. Similarly for the cache- why does templating the stuff we want to cache mess with the cache? Each call-site could totally have its own, anyway - not like a call-site will sometimes have one type of arguments and then later have another, right?
  4. I was imagining that when you said “cache the ReduceF output sizes” you meant that you would do a normal ReduceF::apply and that your cached gatherv call for getting sizes would handle caching transparently. I think right now you have this as another static function on ReduceF?


Generally, I have in mind to design a facility which allows us to easily code up

  • map_rect
  • map_rect_lpdf

both with shared and job-specific parameters. I am not sure what would be the benefit of making it more general than that - you are always free to pass in zero-sized structures to disable either of the two features. At least for me it would make implementation of something more concrete easier. In particular I want to make things as fast as possible in its implementation which is easier to do if things are more concrete (from my perspective). Maybe we discuss this as an agenda item tomorrow?

Your points:

  1. The cache class is just a non-templated holder for the data caches. If you make these part of the templated mpi_parallel_call class then you would cache data separately for each variant we provide… so instead of a single cache we could end up with up to 4x caching the same data if we have all cases (dd, dv, vd, vv) being used in the program (dv = double shared / var non-shared). At least this is how I understand static variables in templated C++ classes. Maybe I am wrong here.
  2. As I say, to me it makes it a lot easier to be a bit concrete with shared and non-shared parameters and it makes implementation a lot easier. I would prefer to go with this for now and possibly generalize it later. The design proposed should be flexible enough to allow us to implement all the different cases efficiently. Moreover, the concreteness enables efficient use of MPI building blocks.
  3. As described above, I want to support that a double+double instantiation and a double+var instantiation leads to using the same cached data instead of duplicating the data.
  4. Since I want to minimize copying things around for each function invocation, I need to pre-allocate all memory at once. Hence, I need the information on output sizes already before the gather call.

What you seem to be looking for are things supported by C++17 ( However, I don’t think we need this; certainly not in a first version. As I said, my preference is to go with something more concrete as I proposed and generalize it at a later stage.