MPI Design Discussion

wds15 · July 22, 2017, 11:07am

I see two solutions for Stan to solve this out of order creation issue of the functors: Either we make these mapped functors default constructible by using a singleton design or the waiter object has an instantiated version of the mapped_functor which contains the user data already. This last option could be implemented with a fusion map, for example (the functor type is the key of the fusion map).

For stan-math I am thinking about a mechanism where the map implementation itself deals only with parameters and for the case with data we provide a mapped functor which transmits data instead of using the local one.

wds15 · August 7, 2017, 8:05pm

I guess we should get active again. After putting some more thought to it, I suggest to start with an mpi implementation which lives in stan-math and proceed as

implementation of a mpi-map which takes a functor accepting only parameters - I do think of this thing like a parallel version of a jacobian function applied to a large set of parameter vectors.
an implementation which takes parameters and data will be included as well. This stan-math implementation will always retransmit the data to the childs and will be based on the code of (1)
the stan-math implementation will include a mpi_commander object (or however we want to call it) which will control the program flow on the childs. That is, this thing will wait for work chunks from the root process and when deconstructed it will free all mpi resources.
after that we start with stan which will expose the functionality as discussed; by that I mean it will allow for static data which is loaded on the childs and the static data will never be sent over mpi.

So unless someone objects against starting with a parameter only mpi map, I will start with that. I know we discussed to go with a parameter and data version in our last meeting, but from a design perspective I find this approach more appealing (if things work out as I have in my mind).

At this stage its early to write an issue. My suggestion is to start a new branch where I put in draft code which nail down the interfaces, but the functions themselves are only filled with pseudo code. Once that is up, it would be great to have discussion to settle on this design to start the actual coding.

Does that sound like a plan to move forward?

Sebastian

Bob_Carpenter · August 7, 2017, 11:00pm

I think I understand, but if you could write out the signature of the Stan math function, it would help make sure we’re on the same page.

If there’s just one big set of fixed data, then you can always encode it in the functor.

Do you see some version of this function being exposed to Stan programs?

Mitzi’s almost done with the data-argument-only constraints so all this will work with functions with appropriate compile-time type checking.

wds15 · August 8, 2017, 7:08am

The intent of the prototype branch is to spell out first the function signatures only and put dummy code into it. Writing this is easier for me with git/emacs instead of this discourse…but I can link those files to this thread, of course.

A always data transmit version can be useful to have in Stan as well, since there would be no restrictions as to where this thing is called (data can be anything). I would still prefer to put resources on the more flexible static data only version we discussed above. So unless it is really straightforward to expose this to the Stan language, I would save our time for the flexible static data thing. BTW, great to know that Mitzi is laying the foundation for this!

One more point: In the non-static data variant I think we have the choice between

to sent always all data to all childs and give the supplied user function the batch index of the processed parameter set
or we split the data like the parameters up (requring ragged arrays or rectangular array shapes) and then we do not need to pass the batch index to the user functor

Option 1 would be consistent with what we have planned for the static data case. Option 2 should be more efficient since less data is being transferred over MPI, since each child only recieves the Nth part of the data (if N is the number of jobs). Any thoughts or preference here?

Ah, and one last point to settle one: Should we use vectors and matrices (Eigen stuff) or arrays (vectors)? I think I always put down function signatures with arrays, since this is what we have done for ODEs - but your dummy prototypes always used Eigen stuff. I don’t mind either way, but if we could agree on that now that would be good - so arrays or Eigen?

Bonus implementation question if the answer is Eigen: Is it OK to still use Eigen data structures internally which would make my life simpler in implementing this…or should I minimize dependencies and then stay away from Eigen in case we go with an array function signature?

wds15 · August 9, 2017, 7:46pm

Thinking through it I ended up drafting a map_rect_mpi function which takes rectangular parameters and rectangular data as input. Here is the dummy code:

github.com

stan-dev/math/blob/feature/concept-mpi/stan/math/prim/arr/functor/mpi_command.hpp

#pragma once

#include <boost/serialization/serialization.hpp>
#include <boost/serialization/access.hpp>
#include <boost/serialization/base_object.hpp>
#include <boost/serialization/export.hpp>
#include <boost/serialization/shared_ptr.hpp>

/* define virtual class which gets send around by mpi. The run method
   contains the actual work to be executed.
 */

namespace stan {
  namespace math {
    
    struct mpi_command {
      friend class boost::serialization::access;
      template<class Archive>
      void serialize(Archive & ar, const unsigned int version) {}
      virtual void run() const = 0;

This file has been truncated. show original

https://github.com/stan-dev/math/blob/feature/concept-mpi/stan/math/prim/arr/functor/mpi_worker.hpp

github.com

stan-dev/math/blob/feature/concept-mpi/stan/math/rev/arr/functor/map_rect_mpi.hpp

#pragma once

#include <vector>
#include <set>

#include <stan/math/prim/mat/fun/to_array_1d.hpp>
#include <stan/math/prim/arr/functor/mpi_command.hpp>
#include <stan/math/prim/arr/functor/mpi_cluster.hpp>

namespace stan {
  namespace math {

    namespace internal {
            
      template <typename F>
      class distributed_map_rect {
        boost::mpi::communicator world_;
        const std::size_t R_ = world_.rank();
        const std::size_t W_ = world_.size();

This file has been truncated. show original

Feedback would be great.

This what we have discussed as I understood. My recommendation would be now to program this into a prototype and rerun the performance test I did. The old performance test used a static data concept and avoided resending data and I think knowing the price of resending data should be known.

Bob_Carpenter · August 11, 2017, 11:02pm

Wasn’t there a way to do this without using the generic Boost serialization? I think the shapes are all known for what’s being sent.

It looked like using the serialization framework was slow in their evaluations (I very much appreciated the way Java handled serialization—it could be generic or hand-coded depending on how fast you wanted it to go).

wds15 · August 12, 2017, 2:17pm

Cool. Prototype runs (I just pushed to math)!

Tomorrow I will try to setup a benchmark along the lines of my first toy example.

BTW, currently I plan to make the prim version of the map a serial only implementation to begin with.

It would be good if we can refactor the code such that we don’t need to copy so much code around when we want to mpi-fy something. Let’s see.

Sebastian

wds15 · August 13, 2017, 1:15pm

Do you think that we should maybe use hashes to detect if some chunk of data has already been send?

The idea is to cache data which has been send and use hashes to detect already send data. This way we end up with a send-once data version.

Anyway, I spend time on refactoring and ended up with a version which can now be used rather flexibly. That is, the current version should allow to easily let sent a data block once and then retransmit new parameters.

Bob_Carpenter · August 14, 2017, 12:18pm

Can’t we just check the memory location? If it’s data or transformed data, that’s not going to change. If it’s not, we probably just want to resend it.

Bob_Carpenter · August 14, 2017, 12:19pm

I think we want serial only versions of everything. Then you can drop in the parallel versions seamlessly where they seem most useful. We do that in some other autodiff settings, I think.

wds15 · August 14, 2017, 12:59pm

For the first stan-math only implementation it will be hard to handle the static data case. Since stan-math cannot assume that clever things are put in place, we have to treat data as varying from call-to-call. Calculating hashes should be fairly cheap vs resending (I will try that out) and could be a nice solution.

In my head we are planning for a basic stan-math version with rectangular shapes as a start and then we are going to add another version which uses static data only and takes a variable amount of static data as input as discussed above. Right?

Bob_Carpenter · August 14, 2017, 1:18pm

My guess is that hashing will cause a noticeable slowdown, but it may still be a win. The problem with sending data is going to be partly latency but we have to pay the latency anyway to send the parameters; then it’s a matter of whether data throughput is faster than hashing throughput and that’ll depend on hash algorithm and on network capacity.

Right.

wds15 · August 14, 2017, 1:23pm

Hmm… could we ask that functors being passed to the map to specify as a type trait if the data can be assumed static?

Anyway, I will concentrate on a good first prototype without too many bells and whistles.

Maybe there is an easy way I don’t see yet to deal with static data given the restrictions of a stan-math only function.

Bob_Carpenter · August 14, 2017, 2:39pm

Sounds like that would work.

What we’re wrestling with is that we really want to be able to send the data once, then keep calling the function. How about trying to tackle that directly?

What I’m thinking about is something like a base functor that MPI functors would be required to extend.

struct base_MPI_functor {
  // real data, ragged
  const VectorXd x_r;
  const vector<int> ends_r;

  // integer data, ragged
  const vector<int> x_i;
  const vector<int> ends_i;

  // constructor, called once to set up constant data
  foo(const VectorXd& x_r, const vector<int>& ends_r, 
      const vector<int>& x_i, const vector<int>& ends_i) {
    ... ships data to MPI ...
  }

  template <typename T>
  virtual Matrix<T, -1, 1> operator()(const Matrix<T, -1, 1>& theta) const
    = 0;
}

The MPI workers need to then get the data and reconstruct the functor. How does this all get compiled? Then when the functor gets called, theta gets passed, evaluated, and things get passed back.

I haven’t thought about this as much as you, so there might be a huge hole in my reasoning here.

wds15 · August 14, 2017, 3:13pm

Hmmm… need to think about your approach.

I different thought: We could ask the stan-math user to provide a hash for the data. If the user likes to calculate a hash every time, then he can do that.

For stan programs it would mean that we can detect static data and if that’s the case, then we just generate a fixed hash at Stan program compile time.

I think that would work - and we would end up with a map_rect_mpi function which can cache the data on the local workers. It would be the users responsibility to give valid data+hash combinations. Is that OK?

Bob_Carpenter · August 14, 2017, 3:37pm

Why do that rather than just sending the data? Or is the idea that with a new hash the data would change?

wds15 · August 14, 2017, 4:01pm

I think I have a design idea which would solve our problem:

one calls a function"register_mpi_call" which you give the real and int data. This distributes the data to the workers and the function returns a hash value.
the actual mpi map is called and we pass in the hash one has obtained is step 1

The step 2 can be called as often as we wish given that the data has not changed and the shapes of the parameter block is the same each time (since we can make that part of the hash). If the user calls in step 2 the facility with an invalid hash, then we can detect that because function 1 wasn’t called.

This way we will only calculate once a hash and can take advantage of local caching of the data. Moreover, we will be able to detect if the user makes a mistake in using the interface.

Bob_Carpenter · August 14, 2017, 4:23pm

OK, that also soves the fundamental problem of splitting the registration of data and calling the function repeatedly.

Rather than a hash, we can just give the data a name. All this kind of naming and hashing is problematic. Hashing mainly because you need to be careful to have a test in case of hash confilcts (or take your very good odds of not getting a conflict if you use a rich enough hash).

wds15 · August 14, 2017, 5:29pm

Yes, all I need is unique identifier. Ideally that uid is represented by an int which is cheap to transmit.

Could we have a singleton which sits there in order to

generate uid which is a simple integer which is increased upon each call of the register function
sends out the data and stores it locally

Is an int name OK?

This is getting quite nice!

wds15 · August 14, 2017, 9:00pm

I think I got this idea working. So code will roughly look like this:

  const std::size_t uid = stan::math::internal::distribute_map_rect_data(theta[0].size(), x_r, x_i);

  hard_work f;

  vector<stan::math::var> res = stan::math::map_rect_mpi(f, theta, uid);

So uid is an int is expected to name all shapes and the data. Internally each call to the distribute function will increase a counter which makes the number unique. Once the data is distributed it can be use as many times as you wish to. You will simply have to pass in the right uid. The map_rect_mpi function can to some limited checking (like checking that the number of parameters was the same or the number of parameter sets).

The prototype is now half-way through using this data distribution approach, so not yet ready.

Topic		Replies	Views
Parallel reduce in the Stan language Developers	12	1170	April 11, 2019
Stan SIMD & Performance Algorithms	23	4020	January 23, 2020
Map_rect & data for the case of MPI Developers	3	530	January 23, 2020
Stan++/Stan3 Preliminary Design Developers	97	4496	June 12, 2018
Parameter packs for (de)serialization Developers	20	4269	August 20, 2018

MPI Design Discussion

Related Topics