MPI Design Discussion


What I’m looking for is what the Stan code is going to look like, not the C++. I want to see how the user writes something that accesses data into all this. Then I think I’ll be better able to think about what we need to support it. As is, I don’t get this binary_data_function thing at all.


Hi Bob!

The binary_data_function struct is just a helper object (as is the mapped_functor). It serves the purpose to communicate what is to be done when using type only information. A lot of this extra code needs to be generated from the stan compiler. Here is a Stan program as it could look like:

functions {
  real hard_work(real param, real[] bar, int[] ind) {
    return(param + bar[1]);
data {
  // note the order of the declared data is later the index to access
  // the data in the global and static tuple
  real foo:
  real bar[3];
  int ind[2];  
transformed data {
parameters {
  real theta[1000];
transformed data {
  real stuff_for_model[1000];

  stuff_for_model = mpi_map(hard_work, bar, ind);
model {
  theta ~ normal(0,1);

There are a few (over) simplifications here done. As we pass in the whole data into each evaluation for a parameter, we should also pass in the job id (1 to 1000) as integer to the hard_work function.

I hope this makes it more clear where I am heading. I think all pieces are now there to do it.


I want to summarize where we’re at in terms of proposals.

For each of the three concrete proposals, I provide the signature of the map function as it would look from the Stan language and the definition of the signature for the function argument f.

In all cases, the rough structure is a map-reduce that follows the actual map function being defined in the Stan language:

  • a root process (rank 0)

    • sends operands to child processes
    • receive value and derivatives back from child processes
    • integrate value and derivatives into expression graph
  • child processes (rank > 0)

    • receive operands from root process
    • calculate value and derivatives
    • send value and derivatives back to root process

Rectangular, retransmit data

vector[] map(F f, vector[] thetas,
             vector[] xs_r, int[,] xs_i)

vector f(vector theta, vector xs_r, int[] xs_i)


  • function is simple and encapsulated
  • clean math library implementation
  • doesn’t go beyond what already exists in Stan language
    • could use standalone function implementations as is for the workers (non root)


  • retransmits data
  • rectangular only, so may require ugly and expensive-to-transmit padding

Ragged, retransmit data

vector map(F f,
           vector thetas, int[] theta_ends,
           vector x_r, int[] x_r_ends,
           int[] x_i, int[] x_i_ends)

vector f(vector theta, vector xs_r, int[] xs_i)


  • all of the pros of the rectangular version
    • function is same, so still encapsulated
  • allows arbitrary raggedness (user must know ragged result bounds)


  • retransmits data
  • awkward raggedness without built-in ragged structures
  • return sizes implicit
    • could int[] result_ends argument for anticipated sizes—check results match

Rectangular, Fixed Data

The data variables x_rs and x_is must be the names of data or transformed data variables, so that they can be loaded once and reused

vector[] map(F f, vector[] thetas, vector[] x_rs, int[,] x_is)

vector f(vector theta, vector x_r, int[] x_i)

This one has a map that just sends the data in once, but x_r and x_i must be the names of variables in the data/transformed data block. Each child process will need to load the data from the model, but then only needs to hang on to its own slice, x_rs[k] and x_is[k]. Then the function can be taken from the standalone function compilation.


  • reads data once on each child process
  • easier to implement than closure
  • easy implementation without MPI
  • should perform better than data transmission each time


  • has to instantiate all of the model’s data on each process, even though it only needs a slice
    • back of envelope, this should be OK for PK/PD apps
    • might not be so OK for distributing big regressions
  • function f needs to know how to grab its slice of data from the index

Could generalize this to ragged in the same way as before.

Closure, send data once

vector map(F f,
           vector thetas, int[] theta_ends)

vector f(vector theta, int fold);


  • data closure means no data arguments
    • could be big win in reducing communications (latency still there in calls and overhead for waits)


  • requires functions that are closures over data
    • requires major extension to Stan language parser and code generator for support and a lot of doc explaining them
    • generate as member functions rather than static
    • might like to have in language anyway
  • could require too much memory per worker as each worker gets everything in the instantiated model


I read the Boost MPI interface and think I finally see where Sebastian’s coming from with all of this. I updated the design discussion in the previous post to include what I think is Sebastian’s preferred approach, which is the rectangular/fixed data. We could also do it ragged.

It doesn’t seem like this will require too much work to implement and it’s relatively clean for the language.

What it doesn’t do is provide a neat way to build this on top of a Stan math library function. At least I don’t see how to do that.


That was a long meeting yesterday, but very productive.

For stan-math there is a neat way with closures, I think.

We decided to create an object which is part of stan-math which will wait for those command type objects and execute these. All functors will be run on the child in the context of that class. Let’s call this the waiter object.

We could then place our static data as a C++ tuple into the waiter object. The waiter object can then make the static data available to the called functor via the tuple which contains the static data at the nodes.

The upside of this approach is that no data ever gets transmitted as we can assume that the data is read in locally at the child. Hence, the data can be anything and it can have any complexity, since we never transmit it. The function signatures would be

vector map(F f,
           vector thetas, int[] theta_ends, ...static data ...)

vector f(vector theta, int fold, ...static data...);

The static data part would be the same on all nodes and it would be the same during all calls of the mapped function. This design would give users the greatest flexibility as we do not force them to put everything into a plain array, but rather one can just use all declared data in whatever shapes are convenient to write down the problem. With C++11 we could use parameter pack template arguments to make these sort of things work.

I am sorry, but I missed yesterday that the data can be assumed to be already on the node and as such it can be complicated. In a way this introduces a state-full function, because that static data needs to be available - BUT stan-math when used with MPI is anyway a state-full thing, because that waiter object must be in place. As such its a small step to expect that static data is already loaded on the worker.

In short: stan-math anyway becomes state-full by going MPI; so we can leverage that to get the static data. This will make the function so much easier to use for stan and stan-math users.



Is ...static data... going to be a sequence of data or transformed data variable names?

I take it the idea is to then generate a class like the model class in which those variable names are instantiated with actual data from the member variables rather than being passed as arguments. And then f wouldn’t actually be passed those arguments, it’d just access them locally.

That probably wouldn’t be too hard to generate. So if we had something like

map(foo, thetas, theta_ends, x, y, N);

then we’d generate a functor class like

template <typename T>
Matrix<T, -1, 1> foo(const Matrix<T, -1, 1>& theta, const MatrixXd& x,
      const VectorXd& y, int N) {
   ... code generated for body of foo ...

struct mapped_functor {
  MatrixXd x;
  VectorXd y;
  int N;

  template <typename T>
  Matrix<T, -1, 1> operator()(const Matrix<T, -1, 1>& theta) const {
    return foo(theta, x, y, N);

And then the mapped functor is the thing that actually gets called in the child main processes. So it would require the main program for the child to create the mapped functor.

P.S. Stan math already is stateful—autodiff is stateful and it’s state is held in a global singleton. Also, any uses of the LDLT types are stateful in that they hold onto data and/or solutions.



That is exactly what I meant. The ...static data... are data arguments declared in data or transformed data block.

This design would give users the easiest way to facilitate parallel computations. Making it possible to pass through arbitrary data structures which have been defined by the user will allow users to write down their model in a very natural way and users aren’t forced to cram everything into those rigid int / real arrays.

I think that this is the design we should go for. This could also be the design pattern which we put into stan-math, but maybe we can easily support both cases - the static data only case and the retransmit data case.



OK, I think we’re in agreement and I think it’s possible.

If the design works out, we might be able to use the Stan math function for the Stan language. I still don’t quite see how that would fit together as the functions are getting configured out-of-process and I don’t see how to control that from within Stan.


I see two solutions for Stan to solve this out of order creation issue of the functors: Either we make these mapped functors default constructible by using a singleton design or the waiter object has an instantiated version of the mapped_functor which contains the user data already. This last option could be implemented with a fusion map, for example (the functor type is the key of the fusion map).

For stan-math I am thinking about a mechanism where the map implementation itself deals only with parameters and for the case with data we provide a mapped functor which transmits data instead of using the local one.


I guess we should get active again. After putting some more thought to it, I suggest to start with an mpi implementation which lives in stan-math and proceed as

  1. implementation of a mpi-map which takes a functor accepting only parameters - I do think of this thing like a parallel version of a jacobian function applied to a large set of parameter vectors.
  2. an implementation which takes parameters and data will be included as well. This stan-math implementation will always retransmit the data to the childs and will be based on the code of (1)
  3. the stan-math implementation will include a mpi_commander object (or however we want to call it) which will control the program flow on the childs. That is, this thing will wait for work chunks from the root process and when deconstructed it will free all mpi resources.
  4. after that we start with stan which will expose the functionality as discussed; by that I mean it will allow for static data which is loaded on the childs and the static data will never be sent over mpi.

So unless someone objects against starting with a parameter only mpi map, I will start with that. I know we discussed to go with a parameter and data version in our last meeting, but from a design perspective I find this approach more appealing (if things work out as I have in my mind).

At this stage its early to write an issue. My suggestion is to start a new branch where I put in draft code which nail down the interfaces, but the functions themselves are only filled with pseudo code. Once that is up, it would be great to have discussion to settle on this design to start the actual coding.

Does that sound like a plan to move forward?



I think I understand, but if you could write out the signature of the Stan math function, it would help make sure we’re on the same page.

If there’s just one big set of fixed data, then you can always encode it in the functor.

Do you see some version of this function being exposed to Stan programs?

Mitzi’s almost done with the data-argument-only constraints so all this will work with functions with appropriate compile-time type checking.


The intent of the prototype branch is to spell out first the function signatures only and put dummy code into it. Writing this is easier for me with git/emacs instead of this discourse…but I can link those files to this thread, of course.

A always data transmit version can be useful to have in Stan as well, since there would be no restrictions as to where this thing is called (data can be anything). I would still prefer to put resources on the more flexible static data only version we discussed above. So unless it is really straightforward to expose this to the Stan language, I would save our time for the flexible static data thing. BTW, great to know that Mitzi is laying the foundation for this!

One more point: In the non-static data variant I think we have the choice between

  1. to sent always all data to all childs and give the supplied user function the batch index of the processed parameter set
  2. or we split the data like the parameters up (requring ragged arrays or rectangular array shapes) and then we do not need to pass the batch index to the user functor

Option 1 would be consistent with what we have planned for the static data case. Option 2 should be more efficient since less data is being transferred over MPI, since each child only recieves the Nth part of the data (if N is the number of jobs). Any thoughts or preference here?

Ah, and one last point to settle one: Should we use vectors and matrices (Eigen stuff) or arrays (vectors)? I think I always put down function signatures with arrays, since this is what we have done for ODEs - but your dummy prototypes always used Eigen stuff. I don’t mind either way, but if we could agree on that now that would be good - so arrays or Eigen?

Bonus implementation question if the answer is Eigen: Is it OK to still use Eigen data structures internally which would make my life simpler in implementing this…or should I minimize dependencies and then stay away from Eigen in case we go with an array function signature?


Thinking through it I ended up drafting a map_rect_mpi function which takes rectangular parameters and rectangular data as input. Here is the dummy code:

Feedback would be great.

This what we have discussed as I understood. My recommendation would be now to program this into a prototype and rerun the performance test I did. The old performance test used a static data concept and avoided resending data and I think knowing the price of resending data should be known.


Wasn’t there a way to do this without using the generic Boost serialization? I think the shapes are all known for what’s being sent.

It looked like using the serialization framework was slow in their evaluations (I very much appreciated the way Java handled serialization—it could be generic or hand-coded depending on how fast you wanted it to go).


Cool. Prototype runs (I just pushed to math)!

Tomorrow I will try to setup a benchmark along the lines of my first toy example.

BTW, currently I plan to make the prim version of the map a serial only implementation to begin with.

It would be good if we can refactor the code such that we don’t need to copy so much code around when we want to mpi-fy something. Let’s see.



Do you think that we should maybe use hashes to detect if some chunk of data has already been send?

The idea is to cache data which has been send and use hashes to detect already send data. This way we end up with a send-once data version.

Anyway, I spend time on refactoring and ended up with a version which can now be used rather flexibly. That is, the current version should allow to easily let sent a data block once and then retransmit new parameters.


Can’t we just check the memory location? If it’s data or transformed data, that’s not going to change. If it’s not, we probably just want to resend it.


I think we want serial only versions of everything. Then you can drop in the parallel versions seamlessly where they seem most useful. We do that in some other autodiff settings, I think.


For the first stan-math only implementation it will be hard to handle the static data case. Since stan-math cannot assume that clever things are put in place, we have to treat data as varying from call-to-call. Calculating hashes should be fairly cheap vs resending (I will try that out) and could be a nice solution.

In my head we are planning for a basic stan-math version with rectangular shapes as a start and then we are going to add another version which uses static data only and takes a variable amount of static data as input as discussed above. Right?


My guess is that hashing will cause a noticeable slowdown, but it may still be a win. The problem with sending data is going to be partly latency but we have to pay the latency anyway to send the parameters; then it’s a matter of whether data throughput is faster than hashing throughput and that’ll depend on hash algorithm and on network capacity.