MPI Design Discussion

I see two solutions for Stan to solve this out of order creation issue of the functors: Either we make these mapped functors default constructible by using a singleton design or the waiter object has an instantiated version of the mapped_functor which contains the user data already. This last option could be implemented with a fusion map, for example (the functor type is the key of the fusion map).

For stan-math I am thinking about a mechanism where the map implementation itself deals only with parameters and for the case with data we provide a mapped functor which transmits data instead of using the local one.

I guess we should get active again. After putting some more thought to it, I suggest to start with an mpi implementation which lives in stan-math and proceed as

  1. implementation of a mpi-map which takes a functor accepting only parameters - I do think of this thing like a parallel version of a jacobian function applied to a large set of parameter vectors.
  2. an implementation which takes parameters and data will be included as well. This stan-math implementation will always retransmit the data to the childs and will be based on the code of (1)
  3. the stan-math implementation will include a mpi_commander object (or however we want to call it) which will control the program flow on the childs. That is, this thing will wait for work chunks from the root process and when deconstructed it will free all mpi resources.
  4. after that we start with stan which will expose the functionality as discussed; by that I mean it will allow for static data which is loaded on the childs and the static data will never be sent over mpi.

So unless someone objects against starting with a parameter only mpi map, I will start with that. I know we discussed to go with a parameter and data version in our last meeting, but from a design perspective I find this approach more appealing (if things work out as I have in my mind).

At this stage its early to write an issue. My suggestion is to start a new branch where I put in draft code which nail down the interfaces, but the functions themselves are only filled with pseudo code. Once that is up, it would be great to have discussion to settle on this design to start the actual coding.

Does that sound like a plan to move forward?

Sebastian

I think I understand, but if you could write out the signature of the Stan math function, it would help make sure weā€™re on the same page.

If thereā€™s just one big set of fixed data, then you can always encode it in the functor.

Do you see some version of this function being exposed to Stan programs?

Mitziā€™s almost done with the data-argument-only constraints so all this will work with functions with appropriate compile-time type checking.

The intent of the prototype branch is to spell out first the function signatures only and put dummy code into it. Writing this is easier for me with git/emacs instead of this discourseā€¦but I can link those files to this thread, of course.

A always data transmit version can be useful to have in Stan as well, since there would be no restrictions as to where this thing is called (data can be anything). I would still prefer to put resources on the more flexible static data only version we discussed above. So unless it is really straightforward to expose this to the Stan language, I would save our time for the flexible static data thing. BTW, great to know that Mitzi is laying the foundation for this!

One more point: In the non-static data variant I think we have the choice between

  1. to sent always all data to all childs and give the supplied user function the batch index of the processed parameter set
  2. or we split the data like the parameters up (requring ragged arrays or rectangular array shapes) and then we do not need to pass the batch index to the user functor

Option 1 would be consistent with what we have planned for the static data case. Option 2 should be more efficient since less data is being transferred over MPI, since each child only recieves the Nth part of the data (if N is the number of jobs). Any thoughts or preference here?

Ah, and one last point to settle one: Should we use vectors and matrices (Eigen stuff) or arrays (vectors)? I think I always put down function signatures with arrays, since this is what we have done for ODEs - but your dummy prototypes always used Eigen stuff. I donā€™t mind either way, but if we could agree on that now that would be good - so arrays or Eigen?

Bonus implementation question if the answer is Eigen: Is it OK to still use Eigen data structures internally which would make my life simpler in implementing thisā€¦or should I minimize dependencies and then stay away from Eigen in case we go with an array function signature?

Thinking through it I ended up drafting a map_rect_mpi function which takes rectangular parameters and rectangular data as input. Here is the dummy code:

https://github.com/stan-dev/math/blob/feature/concept-mpi/stan/math/prim/arr/functor/mpi_worker.hpp

Feedback would be great.

This what we have discussed as I understood. My recommendation would be now to program this into a prototype and rerun the performance test I did. The old performance test used a static data concept and avoided resending data and I think knowing the price of resending data should be known.

Wasnā€™t there a way to do this without using the generic Boost serialization? I think the shapes are all known for whatā€™s being sent.

It looked like using the serialization framework was slow in their evaluations (I very much appreciated the way Java handled serializationā€”it could be generic or hand-coded depending on how fast you wanted it to go).

Cool. Prototype runs (I just pushed to math)!

Tomorrow I will try to setup a benchmark along the lines of my first toy example.

BTW, currently I plan to make the prim version of the map a serial only implementation to begin with.

It would be good if we can refactor the code such that we donā€™t need to copy so much code around when we want to mpi-fy something. Letā€™s see.

Sebastian

Do you think that we should maybe use hashes to detect if some chunk of data has already been send?

The idea is to cache data which has been send and use hashes to detect already send data. This way we end up with a send-once data version.

Anyway, I spend time on refactoring and ended up with a version which can now be used rather flexibly. That is, the current version should allow to easily let sent a data block once and then retransmit new parameters.

Canā€™t we just check the memory location? If itā€™s data or transformed data, thatā€™s not going to change. If itā€™s not, we probably just want to resend it.

I think we want serial only versions of everything. Then you can drop in the parallel versions seamlessly where they seem most useful. We do that in some other autodiff settings, I think.

For the first stan-math only implementation it will be hard to handle the static data case. Since stan-math cannot assume that clever things are put in place, we have to treat data as varying from call-to-call. Calculating hashes should be fairly cheap vs resending (I will try that out) and could be a nice solution.

In my head we are planning for a basic stan-math version with rectangular shapes as a start and then we are going to add another version which uses static data only and takes a variable amount of static data as input as discussed above. Right?

My guess is that hashing will cause a noticeable slowdown, but it may still be a win. The problem with sending data is going to be partly latency but we have to pay the latency anyway to send the parameters; then itā€™s a matter of whether data throughput is faster than hashing throughput and thatā€™ll depend on hash algorithm and on network capacity.

Right.

Hmmā€¦ could we ask that functors being passed to the map to specify as a type trait if the data can be assumed static?

Anyway, I will concentrate on a good first prototype without too many bells and whistles.

Maybe there is an easy way I donā€™t see yet to deal with static data given the restrictions of a stan-math only function.

Sounds like that would work.

What weā€™re wrestling with is that we really want to be able to send the data once, then keep calling the function. How about trying to tackle that directly?

What Iā€™m thinking about is something like a base functor that MPI functors would be required to extend.

struct base_MPI_functor {
  // real data, ragged
  const VectorXd x_r;
  const vector<int> ends_r;

  // integer data, ragged
  const vector<int> x_i;
  const vector<int> ends_i;

  // constructor, called once to set up constant data
  foo(const VectorXd& x_r, const vector<int>& ends_r, 
      const vector<int>& x_i, const vector<int>& ends_i) {
    ... ships data to MPI ...
  }

  template <typename T>
  virtual Matrix<T, -1, 1> operator()(const Matrix<T, -1, 1>& theta) const
    = 0;
}

The MPI workers need to then get the data and reconstruct the functor. How does this all get compiled? Then when the functor gets called, theta gets passed, evaluated, and things get passed back.

I havenā€™t thought about this as much as you, so there might be a huge hole in my reasoning here.

Hmmmā€¦ need to think about your approach.

I different thought: We could ask the stan-math user to provide a hash for the data. If the user likes to calculate a hash every time, then he can do that.

For stan programs it would mean that we can detect static data and if thatā€™s the case, then we just generate a fixed hash at Stan program compile time.

I think that would work - and we would end up with a map_rect_mpi function which can cache the data on the local workers. It would be the users responsibility to give valid data+hash combinations. Is that OK?

Why do that rather than just sending the data? Or is the idea that with a new hash the data would change?

I think I have a design idea which would solve our problem:

  1. one calls a function"register_mpi_call" which you give the real and int data. This distributes the data to the workers and the function returns a hash value.
  2. the actual mpi map is called and we pass in the hash one has obtained is step 1

The step 2 can be called as often as we wish given that the data has not changed and the shapes of the parameter block is the same each time (since we can make that part of the hash). If the user calls in step 2 the facility with an invalid hash, then we can detect that because function 1 wasnā€™t called.

This way we will only calculate once a hash and can take advantage of local caching of the data. Moreover, we will be able to detect if the user makes a mistake in using the interface.

OK, that also soves the fundamental problem of splitting the registration of data and calling the function repeatedly.

Rather than a hash, we can just give the data a name. All this kind of naming and hashing is problematic. Hashing mainly because you need to be careful to have a test in case of hash confilcts (or take your very good odds of not getting a conflict if you use a rich enough hash).

Yes, all I need is unique identifier. Ideally that uid is represented by an int which is cheap to transmit.

Could we have a singleton which sits there in order to

  1. generate uid which is a simple integer which is increased upon each call of the register function
  2. sends out the data and stores it locally

Is an int name OK?

This is getting quite nice!

I think I got this idea working. So code will roughly look like this:

  const std::size_t uid = stan::math::internal::distribute_map_rect_data(theta[0].size(), x_r, x_i);

  hard_work f;

  vector<stan::math::var> res = stan::math::map_rect_mpi(f, theta, uid);

So uid is an int is expected to name all shapes and the data. Internally each call to the distribute function will increase a counter which makes the number unique. Once the data is distributed it can be use as many times as you wish to. You will simply have to pass in the right uid. The map_rect_mpi function can to some limited checking (like checking that the number of parameters was the same or the number of parameter sets).

The prototype is now half-way through using this data distribution approach, so not yet ready.