MPI Design Discussion


That’s probably a good idea, maybe we can explain it to Bob and others and see what they think about the tradeoff of getting something specific to hierarchical models in with and without more general abstractions underneath. FWIW I think we could do it without sacrificing too much performance (tuples are compile-time constructs) - it’s more of a code organization and re-use thing. I am optimistic that we will soon have many more use-cases for parallelism in Stan :)

  1. I was wondering where the cache is going to be used? It isn’t used in the code as presented so far, not sure if that was an oversight or if it will be used somewhere else in the future?
    1a I would template the mpi_parallel_call_cache on the data types, not on the parameter types - these should be the same in all runs right?
  2. The design can implement anything we throw at it if we template it - we don’t need to have separate arguments just for these things that are specific to hierarchical models out in the general purpose building blocks we’re making underneath. I think given that choice it’s good to have something more general underlying if we can do it without too much performance loss.
  3. Is there some way to avoid adding another function to the ReduceF type? Run it the first time with a dummy or something?



To answer your points:

  1. I haven’t referred to the cache in the dummy code yet, but it will be used. If you don’t mind I would like to make all data stuff nested vector types and all parameters Eigen vectors/matrices. This makes it easier to code up for me.
  2. As discussed yesterday, let’s move forward with a more specific design for the moment being. However, if you find a way to quickly adapt things (example code), then I am happy to reconsider.
  3. No question here, right?
  4. I am not sure what you mean… I do run ReduceF with a dummy. However, it cannot be any dummy, but we can only use the parameters given by the user. Otherwise we have a high chance of providing values to the user function which lead to mis-behavior (the user function could require parameter values to be greater than 0, for example).

I think / hope to have the pieces now in my mind complete and I hopefully have a prototype soon which will start working.

The next steps then are to define tests (stan-math), start on the language (stan), the build system (all) and also on making sure that all boost::serialization macros are generated during the stanc compilation phase.


I think to the extent we can easily generalize we should - I understand that C++11’s compile-time tuples aren’t totally obvious (so having the non-templated ReduceF::apply with extra arguments is fine for now) but I think we can still template the cache, right? The data types should be unchanged between iterations and the cache should not be different if the data used are not different, from what I can see?

And for point 4, I was just hoping to get your help brainstorming some way we can avoid having an ugly extra requirement for the reduction operation’s interface. I am not totally sold on this “write it directly into place” part of the interface for ReduceF::apply, since we will then CombineF everything anyway, right? Seems like we can use the more functional approach and use move semantics if need be?

On a lower level for a second, why does CombineF::apply take in eta and theta? And could you change their names to something indicating their purpose, like shared_params and job_specific_params (or something like that)? If we’re going to use the interface here for things besides hierarchical models we should name it that way :)


The cache can go either way, I mostly just want to make sure the interfaces (specifically what is required of reduce) are easy to use. I think we’re really impacting the API negatively do to premature performance improvements - my general rule is write code that is correct and has a nice API and then profile to see where the hotspots are. Premature optimization is the root of all evil, as Knuth says, and I have almost never seen someone correctly guess all of the performance hotspots in real code bases - and typically missing the most important areas for optimization. Examples here are passing the size and memory addresses instead of returning values or references. By doing such low-level work like that we force the compiler into a conservative dataflow analysis that precludes most optimizations. Not dissimilar from all of those examples showing how modern compilers beat hand-rolled assembly most of the time.


Hmm… I see. I don’t trust compiler magic that much - which is sort of obvious I suppose. However, if you have just volunteered for some profiling then I am happy to go for less low-level stuff compared to what I have put up there.

The compiler needs to figure out quite a bit of things here to avoid wasteful copying for which C++ code is really prone to. At least this was the case years ago.

Now, even if we go with more generic versions… how do you handle the cached data and would ReduceF also be in charge for shipping data? I am afraid I don’t see how things can be made to fit together in such generality.

I suggest to go forward with some level of specificity and general names, but stay away of possibly low-level optimization for which I would really appreciate if we can profile these things. Does that make sense?


Yeah, I think that’s fair. We can profile+optimize and abstract later when need be :)


…profiling MPI code will be lots of fun…

Anyway, I have pushed the concept branch a lot further.

I have chosen to let the rectangular input to the function to be always in column major format. In addition all the data-only inputs are column major. This makes my life a bit easier to abstract things, but it is really ugly to have those nested vectors there and even traversing these not in memory local order. However, as this transposing is only done once in the MPI mode, I thought this is OK to do…

Feel free to take another look @seantalts … as a next step I want to add final bits&pieces for making it run and then add first basic unit tests for the double/double version and then for the mixed cases. Unless you have many things to move around, we can slowly embark on the next blocks (testing, building, stan language).



I refactored the interface to be more in line with other Stan standards and now things should be very close to what you had in mind (modulo using tuples). So it would be great if you could have a look at things.

If you agree on the interface of the non-MPI version of map_rect then we can file an issue for that, see here:

There is also a test for the non-MPI version which simply calls that function. There is no error checking yet…

The non-MPI version could be managed by others who wish to help.

I have now a first prototype ready of the mpi version:

The code should run, but I haven’t done that yet… will do that ASAP along with getting the rev version up such that I can benchmark this piece of code. Once the benchmarks are fine (and everyone is on board with the design), we can proceed with inclusion.



Sweet. Some questions on the non-MPI code:

  1. Aren’t the shared and job params the same type? Do you need a type parameter for both of them for some use case?
  2. I like the way you’re adaptively figuring out how big the return matrix should be. But can we figure it out on the first iteration by multiplying the size of the first f by num_jobs and stop resizing on future iterations?
  3. Why is the loop condition i != num_jobs instead of i < num_jobs?


Let me answer:

  1. In most cases the shared and job parameters will always have the same type (either both double or both (f)var). However, in very few cases or just for user-convenience I would allow them to differ.

  2. We can certainly take the first evaluation as a hint for the size of all further outputs, but the user is allowed to return differently sized outputs for each job. As such, we can only make a smart guess and resize if needed. My current guess was 1 element per job which will be a bad guess compared to using the first evaluation as a hint. Let me change that.

  3. I am used to writing i != num_jobs in loops as this leads to faster code. With smarter compilers this maybe not noticeable, but != is faster evaluated than < - at least this is what I read years ago about it.

Any other points or can we proceed with this interface from your perspective?


I remember talking about allowing that (in bullet 2) but forgot which way we went, woops.
I think this is basically what we talked about at that first meeting months ago right? I’m good to go with this. Will take a look at the MPI part today, though I suspect we’ve hammered out most of the high level design stuff :)


We allow the output to be a ragged array, but the input is rectangular.

The current design is close to the high-level stuff which we started to design in that you give the function a shared parameter thing and 3x a

vector < each jobs stuff >

The first vector is for job-specific parameters, the second for job-specific real data and the third for job-specific int data. The actual function (ReduceF) and return type and how things are combined (CombineF) are left flexible.

All we did is to nail down that one always has these elements, but the mpi_parallel_call machinery can be adapted to many things. What is specific is the function signature and the handling of function outputs on the base of Eigen column vectors. Making these things concrete enables a straightforward implementation, I think.

Curious on your thoughts on the MPI stuff… I have never used operands_and_partials, so I hope my CombineF for map_rect does make sense. All what is left to be done is provide templates for the var and fvar case which we maybe should template based on the return type.


(3) is way too premature of an optimization – it’ll never have an affect on performance given the significant overhead here and < is much easier to read.


I just put down reasons for my choices. In this case there is no performance difference. However, I also prefer != notation here for consistency with loops using iterators which are usually written using current_iter != end_iter as a stopping condition.

To me these type of discussions waste a lot of our energy - which is my opinion.


What si the memory local order you need? Is there a reason you can’t store the data that way? You don’t want to do this kind of transpose more than once if you don’t have to unless the data sizes are small.

It’s odd to see this mix of Eigen and standard vector types. Are the std::vector wrappers just to get the versions for each of the parallel inputs? If so, I don’t see the transpose you said you needed.

I can build the non-MPI version if you need a code monkey. This is the size project I could actually finish!

This is also the general assumption of all of our input functions—some things can be data and others parameters.


Hi Bob!

I have abandoned the transpose thing already for various reasons. I wanted parameters to be in column major format as the Eigen matrices are column major and we use these for transmitting things over MPI. The current proposal has the arguments

  • column vector shared parameters
  • array of column vectors for job parameters (so a vector<Eigen::VectorX?>) which resembles what we treat as multi-variate input into things like the multi_normal
  • array of real array (vector<vector>) as real data
  • array of int array as int data

This makes everything nicely consistent as I find, but it does force me to always convert the parameters input from a row-major format to a column major format before transmission. This is probably OK as I anyway have to loop over the parameters in order to convert them from var to double.

Thanks for taking over the non-MPI version! Much appreciated. This will already allow us to start building the special requirements for the stanc compiler later on (remember that the functor is passed into map_rect as type parameter). Once Sean gives green light on the mpi_parallel_call stuff and once I have finished some benchmarks on this design we can go ahead with all of this. I really want to know that the performance is good before hammering it into our code base.


I’m just doing the design review, but I’d say we want to generally follow the conventions in the rest of the Math repo (defaulting to Google’s style guide when consistency lacks). Maybe we want to change to iterators at some point but we can do that more easily if there’s just one idiom to change from, and in the meantime a single idiom is easier for others to follow. Seeing a different idiom like this makes everyone reading it try to figure out what the functional difference could be that caused the diversion from prevailing convention.

More broadly, the main advance of software engineering has been recognizing that the most important performance to optimize is the performance of other developers reading your code now and in the future. Trade off against this sparingly and judiciously, ideally with explanatory comments.

Design comments (numbered for easy reference):

  1. The names are helping me a lot, thanks!
  2. The member integer type parameter to mpi_parallel_call_cache is a neat trick! But is there an easier way to get multiple copies of this struct? For example, can mpi_parallel_call_cache instead be more of a struct or class that is meant to be instantiated (i.e. nothing marked static), and then mpi_parallel_call can instantiate 4 copies of it statically as private members?
  3. Doesn’t ReduceF need the shared params and job params, not CombineF?
  4. Can we make those cached scatterv and broadcast calls that we talked about a while back? setup_call should be much shorter, just a few calls to cached scatterv and broadcast, which we need to define. The mental framework here is that you present an abstraction that just says something like send or retrieve data, and the caching is hidden from anything that needs to use those functions. So you just call send over and over again and you can rely on the implementation to do something performant. Hiding implementation of these lower-level things is what makes working on software possible - imagine if I had to think about register allocation while writing code to generate histograms! The cache_data function is a great step in that direction, but I think we can make something that we can use for sending job and shared params too.
  5. I don’t have a good mental model for why we need to broadcast callsite_id… Shouldn’t this be statically embedded in the source code present on all of the nodes? MPI is still pretty weird to me, sorry!
  6. I haven’t yet internalized what all the transposes and matrix loops are doing in combine and reduce but for operands_and_partials I think you can assign to an entire partial derivative (or some block segment of it) at a time if that helps


I am fully on board with writing code for the project and not for myself which means to follow common standards - really fine by me. Just let me know when needed.

I was hoping for a green light, but I got a yellow one as it feels… so my comments:

  1. I hope the names are fairly consistent.
  2. We cannot make the cache 4 instantiated members of the templated class. We want a cache which ignores the ReduceF and CombineF template parameters. Otherwise we would cache data multiple times for a given callsite as functions get call with double/double or var/var or whatever. See here:
  3. ReduceF gets double only versions of the parameters. CombineF gets the original types to have access to the operands which need to be registered on the AD stack on the root.
  4. I don’t see how to make it more general: Parameters are handled in Eigen structures and data is handled with nested vectors. This choice implies that I have to do different things for each which is fine from my perspective as we want to cache the data, but not the parameters. Moreover, shared parameters are broadcasted anyway such that the scatter you ask for would then just include the parameters in addition. But then, I do turn the parameters into a Eigen Matrix which is really trivial to scatter given that it is column major. I think the cache_data is really close to what you wanted to have; it’s just only applicable to nested vector stuff, but not for mixed things like the vector of parameters.
  5. The MPI processes are completely separated! I have to tell the workers literally everything unless I can cache it. However, the callsite can never be cached as this is needed to define the context and load the cached things.
  6. I do expect the ReduceF for the map_rect to return a matrix with as many columns as outputs and as many rows as we get for each output. For each output we always get a function value, then (if needed) the partials wrt to the shared parameters and then (again if needed) the partials wrt to the job specific parameters.

I have just pushed a version to the repo which includes a running mpi version for the prim case. I should be able to add the rev cases soon.


I grok the issue you linked to, but I was imagining it wouldn’t affect us because I thought a single callsite would always have the same template parameters? Can a single callsite can make map_rect calls with different types? I would imagine they’re always the same unless the code containing the call is itself templated, and we wanted to share cache across those instantiations too…?

Gotcha, I think that makes sense.

For 4, I was thinking we could do something like this NiceMPI layer that can efficiently transmit any PoD type: And then add our own caching layer?

5 - Doesn’t a copy of the program get distributed to every node? So every node has that callsite id integer input as a literal in the source code when it starts up, right? I’m mostly going off of tutorials like this one and have not actually experimented with this in a multi-node setting, apologies.


For 2: In a given Stan program the map_rect function will be called with double only types at initialization (to calculate the log-gradient value) and then there will be calls to it with var types. The callsite is the same, but the types have changed and we we would cache twice.

For 4: I think we should not derail here. We want within-chain parallelization of Stan programs and not a nice MPI API for the moment (we still want a nice internal design though, which we have by now). Moreover, we transfer stuff as POD quickly, but we want to cache vector of vector things. What I am saying is what we have is not so bad and making it even smarter causes quite a headache and this is a side-goal.

For 5: Every node has a copy of the program, yes. The way we currently design things is that we put all the workers into an always listening mode. So the program on the child node starts and then listens for commands being send by the root. Hence, the childs do not have any of the data or whatever loaded (which is why we need that caching). There are two entry points into the mpi_parallel_call: The constructor with many arguments and the constructor with just the callsite. This main constructor is called on the root and this sends a broadcast message to the childs. The childs are told to call the static method distributed_apply of mpi_parallel_call. This method then listens for the callsite_id (which the main root constructor sends) and then creates the mpi_parallel_call on the childs. The setup_call routine on the root sends all the data while the same setup_call function on the childs listens/recieves the data.

Unless you feel strong on more abstraction for cached scattering I would proceed now. For me that means to put down a rev implementation of CombineF for map_rect and then run some performance benchmarks. Once that is all good we are getting very close to inclusion. That’s my suggestion to proceed.