MPI Design Discussion


Good point. This seems like a decent way to deal with that for now.

We all want to get MPI in as soon as possible. We’re very excited to be able to scale out in a way we never could before, and doing that with generality is very exciting. The design review is exactly about the internal design and architecture - making it first and foremost easy to understand for maintenance. This involves creating layers of abstraction that hide details away so someone who doesn’t know the codebase can come in and understand a minimal subset of the layer they need to work on. We can have a conversation about the best way to do that, but that is the objective here.

For 5 - Gotcha. The MPI stuff I looked at didn’t have this same client-server style architecture. I think I understand why we want that - we need the root node to guide HMC exploration and the children to do whatever work the root decides needs to be done, so the root could be branching based on parameters or what-have-you. Cool.


For point 5: Yes, the MPI design we have put up is essentially following what others call a thread pool; we just do it across different machines. And yes, the architecture is very flexible in that we have a controlled way to put workers into states we want to when we want to at any point in time which is what we need for HMC.

Here is why I think we are done: The current design packs data (which we want to cache) into nested vectors and parameters into vectors of Eigen column vectors (the shared ones are just a Eigen column vector). The parameters are always converted to a Eigen matrix internally. Now, the boost mpi library is really easy to use when you just transmit a chunk of memory which is what an Eigen matrix is (and we do not want to cache parameters ever). So I don’t see a need for additional abstraction here. Our data on the other hand requires flattening to a 1D array, transmission, then reshaping the data to 2D again and then we want to cache the 2D data. For this we have the cache_data function. I don’t see a need for more. The broadcast operation is to me not worthwhile to abstract; it is already very simple.

So from my perspective we have abstracted things where its needed and we deal with the boost API as is where its fine (for me). Moreover, we do have agreed with the group to build a map_rect, not more. We already have at this point much more:

  • a MPI version of a threadpool to control workers
  • transparent caching of static data
  • a rather general mpi_parallel_call which can be used as back-bone for other things than mpi_rect
  • the generality of mpi_parallel_call enables easy implementation of var/fvar/mixed cases

To me we are already way above what we aimed for and moreover I see the data distribution stuff as an internal part of mpi_parallel_call. However, you seem to disagree so please comment on the above design points and if you could also suggest solutions that would even be better. Maybe this is another point to discuss in the group to decide between generality and specificity of the design towards map_rect. Not sure.


This is a good point. But it might still be nice to be able to abstract scatter such that you can just call my_scatter(job_params); my_scatter(shared_params) in a way that is similar to how the cache_data method is set up, but without caching. So you can kind of wrap up and cache the meta info broadcast about sizes and the resizing before and after transmission (though you’ll need broadcast for shared_params and scatterv for job_params). This separation would also cost us an extra broadcast at setup time. Basically just extending those collective communication commands to serialize and deserialize (a couple of) C++ types. Boost’s MPI serialization interface had a decent idea if apparently poor implementation (NiceMPI being an example of a much better implementation of a similar API). The code to do this exists; wrapping it in a function I am not expecting to take too much work?

I think we’re building something great here that will be pretty important for Stan for years to come, and I am excited about the separation of concerns we’ve already achieved. I think we’ve gotten it split up into smaller chunks that are easier to deal with individually, and we’ll better support future parallelism as well. Thanks again for going through all of this with me.


You are right in that the meta-info about the sizes of the parameters is actually treated as static data. However, I am not sure if we can write a single function for it: job parameters are scattering a matrix while shared parameters are broadcasting a shared vector. So that will be two functions anyway. If you think that is a win over what is there already, then that should be doable, yes - but you are not going to cut down much of the code.

And yeah, I can’t wait to apply this MPI stuff to projects. I am willing to do model development based on dev code simply because it will be such a major difference in speed. Iterating my models goes down from multi-day-wise operation to iterating within ~1h.



I think if we had about the same amount of code but broken up into smaller pieces (as long as those pieces are modular and make sense on their own, which is open to discussion) that would still be a nice win - it can help in thinking about the problem to develop chunks of abstraction. I think a scatter (and broadcast) function overloaded for different types might be a decent way of breaking that up modularly? But you’re deeper in this than I am.


If it’s a large matrix, this can cause a lot of overhead. More so than constructing a var for each on the autodiff stack. Just try profiling for different size matrices even without autodiff. At a certain point, you’re going to blow cache and it’s going to really slow down.

A headache for whom? If it makes the code a lot easier to read, that may be worth it. This is related to Sean’s comment, which gets at the key to maintainability (and to a large degree, testability):

So if an abstraction makes things easier to test and isolates some difficult code and makes the rest easier to read, it can be worth it. Sean gets at that here:

The challenge here is that the developer of a piece of code can see its structure much more clearly than someone else coming in to look at it, but the code needs to be written for that next person.

Sean—is this something we can retrofit later?

This is one of the keys to the readabilty issue. It’s much easier to read the name of a function and look at its arguments than to try to infer that from a block of code. This is precisely one of those places where it’s much harder for the reader of the code to get a picture of what’s going on than the author.


Hi Bob!

I am bad with discourse quoting so let me refer to points:

  1. format of the input parameters as vector of column vectors: Probably the most efficient input to the function would an Eigen matrix for the parameters in column major format and row-major vector vector for the data; but this is very confusing to the user. Moreover, I will anyway have to loop over the var input parameters anyways and turn it into a dobule only representation in order to send it over MPI… or can I treat a matrix_v like a matrix_d from a memory layout perspective? So a matrix_d object params allows me to access the parameters as in a continuous way. If I have a matrix_v params_v - can I expect that points to a continuous chunk of memory which are all the double only values of the parameters? If so, then this would be very interesting.

  2. @seantalts is requesting to refactor functions which are private and not exposed. I acknowledge the point that the code is not written “for me” and as such modularizing things is a good thing. However, private stuff is probably not required to be super-modular. At least this is what I think (still it needs to be readable to others, yes).

Wrt. to readability of the code and the later review: It will probably help a lot if I write a short PDF as to what factors are important for this problem and how that went into the design. I don’t think that the code itself can ever be such self-explanatory given the MPI peculiarities. Much like the ODE sensitivity stuff - from looking at those loops in that code I would never understand whats going on. My suggestion would be put that PDF down once the design is settled.



I have made a number of changes:

  1. I broke up the the setup_call into smaller building blocks and introduced a few scatter + broadcast variants tailored for our needs. I hope these are now clearer to follow.

  2. I moved the functions to deal with the cache into mpi_parallel_call as these functions only make sense when called with callsite_id_ such that I thought it is better to move them into the class. BTW, C++ is quite amazing at what the syntax enables.

  3. I added to the ReduceF a get_output_size function as it is required to know the output size from ReduceF without ever calling the function. This is needed as the ReduceF:apply function may throw at the very first evaluation. In that case we would never know what the correct output size is. However, we must know that in order to keep the cluster in a synchronized state even when an exception is thrown.

  4. I added exception handling which requires that even if any of the jobs throws we can only stop executing things, but we must ensure that we always return from each worker the same amount of data to the root. Otherwise things would get out of sync. This leads to special considerations for the very first evaluation which we use to determine output sizes and cache these. If on the first evaluation any job throws, then we must not cache the output sizes anywhere. We can only cache the output sizes if all evaluations on all nodes where successful upon the first evaluation.

  5. I figured a way to use google test to run MPI tests. For that I needed to hijack the makefile make/tests. Consider it as a workaround for now which we need to work out.

  6. I added a ReduceF for the vv case.

  7. Now we have some basic first tests for the dd and for the vv case!! I have put them into the canonical places.

Right now there are still lots of cout debug messages which I will remove soon. I am now in the position to update my old benchmark tests to work with this branch. I really hope that the performance is good and then we can start with inclusion from my perspective. Of course, subject to your thinking on the current state of affairs.



Just highlight and a pop-up “quote” button pops up. Maybe it’s harder under Linux.

The point is that if you have a large matrix, it will be a lot slower to loop out of order than in order. It gets worse in higher dimensions. Yes, this is a big deal. Sometimes it can’t be avoided, as in a matrix-matrix multiply where the left argument needs to be accessed row major.

No. It’s a continuous sequence of var, but that’s only a pointer to vari, which isn’t necessarily continuous in memory. So you need to chase the pointer to the vari and then pull out the value.

Right. This is a code review, not an API design review. Private, public, etc. are all fair game. The point is that we want to write the code so that it’s understandable and maintainable. The private stuff has the same requirements for modularity in making the code readable as the public components.

I think you should do it first as it’ll help us understand where you’re coming from.

This will need to be tested for consistency as we go on. We can’t assume that a function written in Stan will always return an output of the same size.


It is never harder under Linux.


Ah, now I got it! This marking stuff even works for multiple quotes… cool!

And for the record… I am often on Windows (everything is hard) or on Mac (much better)

Yes, we will throw errors in case there is ever any deviation in the output sizes after the first successful evaluation. BTW, we will allow the non-mpi version of the function to be liberal in what is returned.

uh… high standards… as usual. I really hope we are there soon.

For now @seantalts is following the design closely such (I think) he has most details present as well. For the inclusion we should have more eyes and that PDF should ease the review, yes.


bump… I just fired off new benchmarks using the new code on our cluster. Fingers crossed all is OK and I can plot tomorrow a new benchmark plot to confirm that the new design is solid and performs great (one my 4-core laptop I saw yesterday a 3.7x speedup on an ode problem).

@seantalts any thoughts on the code at this point?


This appears to look very promising. So my first run completed with the new prototype had these properties:

  • analytic oral 2-cmt pk model which has 3 states and 5 parameters
  • J=16000 subjects were simulated
  • out of the 5 parameters 4 were shared and only 1 was subject specific and 4 were shared

The run time of the single core MPI run was 10.5h and for 16 cpus this goes down to 50 minutes which is about 13x speedup; for 20 cpus we are at 45 mins and a 14x speedup. Note that these speedups are wrt to the single core MPI version while the serial non-mpi code should be a bit faster (~10%) such that the actual speedups are a bit lower. I will need to make that serial run.

Right now I have started another run which treats all 5 parameters as subject specific which increases the stress on the MPI system.

All tests are run with the latest develop code such that results are quite different from previous runtimes which were affected by the std::string bug.

I think we can very likely conclude that the current prototype does a decent job. So the signature put forward in stan language is:

  • The user defined function needs to have the signature
    vector mpi_function(vector shared, vector job_specific_params, real[] x_r, int[] x_i)
  • The map_rect stan function has the signature
    vector map_rect(mpi_function, vector shared, vector[] job_params, real[,] X_r, int[,] X_i)

Let’s discuss at today’s meeting if everyone is happy with this or more tests should be done.


That all sounds reasonable to me. Sorry I couldn’t make the meeting. Mitzi, Sean and I are teaching next week and tied up getting our course materials together. Should be back to normal as of Monday 11 December.


I have just figured that the callsite_id could be used more efficiently if we make it part of the template arguments to mpi_parallel_call. Doing so allows to get rid of the std::map such that we can directly access the correct cache items. However, the largest benefit is that we save a full broadcast call for each evaluation of the map_rect call! And that’s quite cool as latency is always an issue and we will save this with every function evaluation. We simply make the callsite_id part of the type which is transferred. The only down-side of it is that you now have to register for every call to map_rect the respective types for it with the boost serialization system. However, as I have created macros for this, I think that is not too painful.

I hope you agree on the design as is. Let me know if you have more thoughts and comments.


Is that flexible enough? You’d have to know all the IDs statically.


We anyway are forced to create a macro call per functor which we like to be able to call remotely. Making the callsite part of this definition means to do a macro call per functor+data combination. The upside is reduced MPI communication and a cleaner+more straightforward way of handling the static data (no more need for a map).


For anyone that is interested, @wds15, @Bob_Carpenter, and I will meet tomorrow at 10 am Eastern time on Google Hangouts to get into technical details.

If you’d like to join, please let me know and I’ll invite you to the hangout.


For those who join it would be great to grab the respective branch on stan-math

and have the doxygen documentation build.


What in the doxygen doc should I be looking at? I have it built, but not not really seeing anything too helpful, so I must be missing the important part.