MPI Design Discussion



  • Can we use signed integer identifiers? I find unsigned integers too error prone.

  • You don’t need the size argument if x_r and x_i are containers — the sizes have to match and whatever the size is gives you the argument value.

  • The serial implementation is going to need to figure out how to store the data, but that should be worth the tradeoff—it shold be doable with a global std::map, though that’s just one more thing to have to deal with if we ever tr to get multi-threading going.

  • Why generate an ID? I’d have thought it’d be easier to have the client pass it in.

  • I don’t think map should be in the name because the data distribution is pre-map and the function includes the reduction step; I’m going to suggest including _mpi but I’m on the fence about that one, too, as it will be implemented sequentially without MPI if MPI isn’t installed.

  • I don’t understand how you can pass a functor—in general, a functor can hold data. And it’s a local reference. See below.

All in, I’d think it could be just this:

register_data_mpi(const VectorXd& x_r,
                  const vector<int>& x_i,
                  long id);

template <class F>
call_mpi(const F& f,
         const Matrix<var, -1, 1>& theta,
         long id);

You could get rid of the f argument to call_mpi if you replaced the use of f(...) in the program with F::apply(...). So F would have to implement this concept:

static Matrix<var, -1, 1>
F::apply(const Matrix<var, -1, 1>& theta,
         const MatrixXd& x_r,
         const vector<int>& x_i);

That solves the problem with functors having data in general.

  • Signed ints are fine as well - any data type which small and native to MPI

  • Most sizes should be transmitted as this is information which is needed when looping through the data (remember that I flatten all data first before transmitting)

  • I have used a singleton for a std::map to store the data locally. Works like a charm.

  • We can generate IDs or let the user specify. We should probably support both schemes. Provided IDs make sense in auto-generated Stan programs whereas generating them makes sense in usual C++ programs, I think.

  • Naming schemes are all subject for discussion.

  • Hmm… I see your point on the functor. Maybe this is a good reason to go with this apply thing you suggest. As the implementation is based on tossing types around this will reflect this in the design.

I have just pushed a (messy) prototype which implements the idea and recycles data for a second call without the need to retransmit.

BTW, as we now have arrived at a good approach to send data once, I start to think that we may never need a ragged version. The output of the function must be ragged, but not the input data. Given our caching of data mechanism it is now easy to just make all data rectangular and pad it out. There is virtually no price to pay for that. However, the outputs must be ragged for good performance to keep the AD stack small. In case of regular output, then the user can easily convert the output to a matrix using some of the “to” functions.

If at all we go ragged, then the parameters should be ragged which is something I have not yet come across…


This all sounds great.

Yes, I see why you need them on the back end, but we shouldn’t need them in a function call that takes two std::vector arguments that have to be that size—it’s just redundant. But maybe I’m misunderstanding.

I think we can make the user supply such IDs either way.



So I think I have now reached a state where it would make sense to settle on function names, design. Right now I have implemented a scheme like this:

  std::cout << "Distributing the data to the nodes..." << std::endl;
  std::size_t uid;
  // this will ship the data out and output a new uid 
  vector<stan::math::var> res = stan::math::map_rect_mpi<hard_work>(theta, x_r, x_i, uid);
  std::cout << "Executing with cached data for uid = " << uid << std::endl;

  // recycle the data with uid
  vector<stan::math::var> res2 = stan::math::map_rect_mpi<hard_work>(theta, uid);

We could change this to a register data first and then call map_rect_mpi. In addition I have followed you suggestion to require that the functor hard_work implements the concept (static apply function) you suggested.

  • I have also spend some thought into the interface with the mpi stuff. So now sending a command is done by a static function which is part of mpi_cluster. For example, mpi_cluster::broadcast_command<stop_worker>() will take down the mpi workers. This command is part of the destructor of mpi_cluster.

  • I also started to follow the idea of the boost::mpi library where you call a function on the root and on the childs with the same arguments, but they do slightly different things depending on where being called. This leads to code with few if-statements, but overall a lot of code can be shared by this. See for example the internal::distribute_map_rect_data function, here: .

  • I introduced a generic mpi command which works by calling the static function distributed_apply of some struct. This avoids coding of many command objects.

In short, we should maybe discuss the design once more. That is, we need to settle on how things look in stan::math and how stan will use these facilities.

Once we agree on the API we can put down the feature request and then discuss matters on a pull request? Should we have hangout to talk?


Sounds good and agree it’ll probably need some face-to-face time to understand all this more quickly. I sent you some email offline to coordinate.


I have now included the parts from boost which need to be build as library. I also started to integrate it with the make system to make it easier to use this stuff.

I am still not clear on how to instruct make to use mpicxx when needed. Maybe we just ask the user to call

CC=mpicxx make

when he wants to get mpi stuff… but all of this are details down the road I suppose.

Talk to you!


Or ask them to put it in make/local (for cmdstan).


Maybe putting things into make/local works, but I am not sure if you want that cvodes is compiled with mpicxx. Maybe it doesn’t hurt; I don’t know yet.

I thought it would be better to to compile only the models with mpicxx and not the rest.


Today I took the time to try out the new MPI code on the toy example from the first prototype. The biggest trouble I had is to assemble a working build system and get the libraries ready. However, I figured it out and attached are strong and weak scaling results. I am simply speechless … the performance scale linearly in the strong case. For the weak case I am not so sure how well it represents MPI scaling in terms of speeding up as the statistical complexity increases along.

Anyway, I want to sort things a bit and hopefully have an easy to setup cmdstan branch available. I am just getting addicted to this performance… going from 2h down to 7 minutes for the same problem is just crazy (numerically identical results).





I was trying to simplify the build of this, but I am not succeeding in setting up a static-only build due to problems with the boost serialization library which is quite difficult to deal with when one tries to use static linking:

It would be great if someone could help here. In the meantime I will add to the prototype branch a readme which instructs how to build using a dynamically linked serialization and mpi library. Or would dynamic linking be ok for deployment? Dynamic linking is obviously more error prone.



This might work better as a separate thread about the problems you are having—it seems like the design of (at least) the prototype is settled here.



I think you’ll need @seantalts or @syclik for this, but please open a new thread for build issues as @sakrejda suggests.


I’m not sure what you mean by “weak scaling”. You mean you increase problem size and number of cores and expect a constant run time?

I also don’t know what “fill-up scheduling” is or what you are referring to as J_0.

  • weak scaling: per cpu I am fitting 80 ODE’s. So 1 CPU = 80 ODES; 4 CPUs = 320 ODEs and so on. Yes, I would like to see constant execution time in that case. However, as the statistical complexity increases as well this won’t hold. Still, the information from this plot is a good one in that it takes 2.5x the time to crunch a problem of size 1600 ODEs ( ! ) vs attacking 80 ODEs on a single core. This is fantastic news for PK/PD problems since the problems range from ~50 to ~1000 patients and this plot tells that the execution time will be OK (at most 1h vs almost a day computing time). This nothing short than a breakthrough for Stan in these problems.

  • I refer to fill-up scheduling and 20 cpus to the fact that our cluster has nodes which have 20 physical CPUs and the scheduler will fill those 20 CPUs up before he moves to the next node. In short: All these runs are run without any network latency as all communication happened on the same machine.

Wrt. to the design: By now we only need to settle on how to name things and where to pack it. For example, I am leaning towards introducing a stan::math::mpi namespace where we can place mpi specific things, but we can discuss this once you are back.

I will try a few more things wrt to getting the build straight and then start a new thread for this.


This is huge and exactly why this is the general-purpose feature I’ve been most excited about since user-defined functions!

It’ll potentially give us a couple of orders of magnitude scalability, which is just amazing at this stage. We still have a bunch of improvements we could make down in the guts (like vectorized, analytic derivatives for transforms).

Combined with sparsity, this will also open up a huge class of “machine learning class” models we can’t tackle well now, like huge sparse regressions and GPs. I’m banking on it working to the point where I’m writing it into a grant proposal with some collaborators to work on sparse GPs.


Yup. Stan goes form a toy for research problems which you use when you “have to” towards something which scales for a broad class of large problems. That’s a different thing.

I also have ideas with @avehtari going where I would like to use stiched GPs with this technique to apply to GPs in PK/PD problems.

Possibly we should write up a manuscript for some peer-reviewed journal on this; just to make it more broadly known what can be done.

Wrt. to scalability: I could perform at some stage a careful benchmark on our cluster which can use at max 1500 CPUs on 75 computing nodes here (linked with InfiniBand). BTW, I saw that openmpi has GPU management options; so this can be combined with GPU stuff at some point.


Sure. I’ll leave that up to you. I’m always happy to give people feedback.

I’ve come to loathe the journal and conference submission process. I gave up submitting to closed-source journals ages ago when I couldn’t even access papers I’d reviewed becuase I wasn’t at a univesity. The insane amount of hassle involved in that JStatSoft paper was the final straw.


Ouch! I like to think that research problems aren’t “toys”. But your mileage (or kilometerage) may vary.


I know… painful process! Some papers take literally years.

To add: I am curious how much non-ODE problems benefit from this. ODEs are an obvious win, but what is the computational complexity one needs to make this MPI stuff a win is an interesting question.


“toys”… sorry for that. I meant “not applicable” in the sense that it is not a practical solution to use full blown Bayes on problems where I just have to wait too long in order to make progress.

Not working in research means to have oftentimes finite time to deliver. I just can’t wait ages for Stan to complete. MPI solves that issue for the problem class I am interested.

However, for non-PK/PD problems (or analytically solvable PK/PD) Stan can be put to good use already now. It’s just very restrictive for PK/PD in the sense that ODEs are in theory solvable, but in practice it’s more of a theoretical option given the numerical cost.