MPI Design Discussion

After the discussion in the meeting yesterday, I think the easiest thing to do might be to allow users to define functions after the transformed data block that access data or transformed data. These would then be compiled to member functions of the model class, where they’d have read access to the data/transformed data member variables. Right now, our functions are purely standalone and have not access to anything. Then there would be no explicit passing of data arguments. The parameters would still get passed.

data {
  int N;
  vector[N] x;
  vector[N] y;
}
transformed data {
  vector[N] x_std = (x - mean(x)) / sd(x);
}
functions {
  vector foo(real y) {
    vector[N] z;   // grabs N out of data block
    for (n in 1:N) z[n] = n * y;
    return z;
  }
}

Going even further, we could use closures from within the execution of the model itself to grab parameters, but I don’t see how we could pass that around.

How would the Stan program and function definition look for what you’d like to do? I can’t quite see what the concrete syntax would look like to the Stan programmer.

I think we should have this even if there were no MPI.

I was saying that the function which we give to map (and runs in parallel) must not call any _rng function in order to warrant exact reproducibility. Otherwise we face the issue of guranteeing exact reproducibility in a distributed program where the random number generators run on different machines…no fun to synchronize.

What you describe, Bob, sounds like a very good solution in this setting to me. This would also solve the trouble of ragged/rectangular data as this is pushed to the user who gets to access all data on each worker and we would only pass in the job index and the parameters set per run into the function running in parallel.

I will put together today a prototype of my idea. It will only involve special internal handling of data arguments within the map function. The user functions stay as they are.

Ok, so here to goes:

The main trick is to encode with a type a functor to call (the user function) and as integer template arguments the position of the data in the static_data tuple.

In short: All data and transformed data must be thrown into a tuple. The declaration order defines the position in that tuple and is used at the same time as key to access it later. The binary_data_function struct glues together which user functor to use and which position from the static data object we need.

I think this is a very sweet and very lightweight (no singletons) implementation. In essence the static data of the program is put into a tuple (so to say a storage holder generated at the stan-program compile time). Then tuple positions can be used as type information to just do the right thing.

The approach is flexible in that it should be possible to allow an arbitrary number of arguments if we wanted to. This will put a lot of work onto the Stan compiler to generate the right C++ code. The best thing is that the awkward handling of data is completely internal to the map function; so the user function stay as they used to be. All we need is special code generation once we see a map function.

I do think that the approach to use a function block after the transformed data is more elegant. My proposal would use the language as is, but it is somewhat tedious to get the code-generation part done as a number of definitions need to be put in place.

I don’t see how this is necessary as the RNG’s in the transformed data block would be evaluated before any data are passed to the nodes.

Ok, so if in the generated quantities block a mpi enabled function is sent to the nodes, then I do not want them to be allowed to call a rng ever. If these functions would call a rng, then the state of the rng generator needs to be organized asynchornosly accross the nodes.

But RNG use isn’t allowed inside user-defined functions (unless they end in _rng) already. If we’re limiting the application to functors defined via user-defined functions then how do the generated quantities become relevant?

Ok, then we limit the map function to only take user functions which do not end in _rng. That would avoid any problem.

I figured a way which may allow us to let the user specify an arbitrary number of static data arguments to the map function. The idea is to use boost fusion to hold the data and to use boost mpl vectors of boost::mpl::int_ types to encode what I want to be done. Here is the code which is almost complete; the only thing missing is to iterate over the boost::mpl::vector.

Should we get this to work, then this would fit into our current language as-is and it would give the user the full flexibility to access all data which he has declared in data or transformed data.

This meta programming C++ is really magic I have to say.

We could put an implementation of an MPI map into stan-math which only deals with functors which do not take any data, but only parameters.

From the stan language we do generate function objects which take the data the user wants to use from a globally defined variable which is a tuple holding the user defined data (and transformed data). I know its not too nice to have globally defined data, but this would solve our problems. Here is a prototype of that idea. Note that the functors which execute the command (the run() method) does grab the actual data from the global space:

So we put MPI stuff into stan-math and all static data matters into stan. That does make sense to me as stan handles the user data and stan-math the computation bit.

Right, they can’t be _rng or _lp functions (those require the generated quantities RNG and the model target accumulator respectively).

What I’m looking for is what the Stan code is going to look like, not the C++. I want to see how the user writes something that accesses data into all this. Then I think I’ll be better able to think about what we need to support it. As is, I don’t get this binary_data_function thing at all.

Hi Bob!

The binary_data_function struct is just a helper object (as is the mapped_functor). It serves the purpose to communicate what is to be done when using type only information. A lot of this extra code needs to be generated from the stan compiler. Here is a Stan program as it could look like:

functions {
  real hard_work(real param, real[] bar, int[] ind) {
    return(param + bar[1]);
  }
}
data {
  // note the order of the declared data is later the index to access
  // the data in the global and static tuple
  real foo:
  real bar[3];
  int ind[2];  
}
transformed data {
}
parameters {
  real theta[1000];
}
transformed data {
  real stuff_for_model[1000];

  stuff_for_model = mpi_map(hard_work, bar, ind);
}
model {
  theta ~ normal(0,1);
}

There are a few (over) simplifications here done. As we pass in the whole data into each evaluation for a parameter, we should also pass in the job id (1 to 1000) as integer to the hard_work function.

I hope this makes it more clear where I am heading. I think all pieces are now there to do it.

1 Like

I want to summarize where we’re at in terms of proposals.

For each of the three concrete proposals, I provide the signature of the map function as it would look from the Stan language and the definition of the signature for the function argument f.

In all cases, the rough structure is a map-reduce that follows the actual map function being defined in the Stan language:

  • a root process (rank 0)

    • sends operands to child processes
    • receive value and derivatives back from child processes
    • integrate value and derivatives into expression graph
  • child processes (rank > 0)

    • receive operands from root process
    • calculate value and derivatives
    • send value and derivatives back to root process

Rectangular, retransmit data

vector[] map(F f, vector[] thetas,
             vector[] xs_r, int[,] xs_i)

vector f(vector theta, vector xs_r, int[] xs_i)

PRO

  • function is simple and encapsulated
  • clean math library implementation
  • doesn’t go beyond what already exists in Stan language
    • could use standalone function implementations as is for the workers (non root)

CON

  • retransmits data
  • rectangular only, so may require ugly and expensive-to-transmit padding

Ragged, retransmit data

vector map(F f,
           vector thetas, int[] theta_ends,
           vector x_r, int[] x_r_ends,
           int[] x_i, int[] x_i_ends)

vector f(vector theta, vector xs_r, int[] xs_i)

PRO

  • all of the pros of the rectangular version
    • function is same, so still encapsulated
  • allows arbitrary raggedness (user must know ragged result bounds)

CON

  • retransmits data
  • awkward raggedness without built-in ragged structures
  • return sizes implicit
    • could int[] result_ends argument for anticipated sizes—check results match

Rectangular, Fixed Data

The data variables x_rs and x_is must be the names of data or transformed data variables, so that they can be loaded once and reused

vector[] map(F f, vector[] thetas, vector[] x_rs, int[,] x_is)

vector f(vector theta, vector x_r, int[] x_i)

This one has a map that just sends the data in once, but x_r and x_i must be the names of variables in the data/transformed data block. Each child process will need to load the data from the model, but then only needs to hang on to its own slice, x_rs[k] and x_is[k]. Then the function can be taken from the standalone function compilation.

PROS

  • reads data once on each child process
  • easier to implement than closure
  • easy implementation without MPI
  • should perform better than data transmission each time

CONS

  • has to instantiate all of the model’s data on each process, even though it only needs a slice
    • back of envelope, this should be OK for PK/PD apps
    • might not be so OK for distributing big regressions
  • function f needs to know how to grab its slice of data from the index

Could generalize this to ragged in the same way as before.

Closure, send data once

vector map(F f,
           vector thetas, int[] theta_ends)

vector f(vector theta, int fold);

PRO

  • data closure means no data arguments
    • could be big win in reducing communications (latency still there in calls and overhead for waits)

CON

  • requires functions that are closures over data
    • requires major extension to Stan language parser and code generator for support and a lot of doc explaining them
    • generate as member functions rather than static
    • might like to have in language anyway
  • could require too much memory per worker as each worker gets everything in the instantiated model
1 Like

I read the Boost MPI interface and think I finally see where Sebastian’s coming from with all of this. I updated the design discussion in the previous post to include what I think is Sebastian’s preferred approach, which is the rectangular/fixed data. We could also do it ragged.

It doesn’t seem like this will require too much work to implement and it’s relatively clean for the language.

What it doesn’t do is provide a neat way to build this on top of a Stan math library function. At least I don’t see how to do that.

That was a long meeting yesterday, but very productive.

For stan-math there is a neat way with closures, I think.

We decided to create an object which is part of stan-math which will wait for those command type objects and execute these. All functors will be run on the child in the context of that class. Let’s call this the waiter object.

We could then place our static data as a C++ tuple into the waiter object. The waiter object can then make the static data available to the called functor via the tuple which contains the static data at the nodes.

The upside of this approach is that no data ever gets transmitted as we can assume that the data is read in locally at the child. Hence, the data can be anything and it can have any complexity, since we never transmit it. The function signatures would be

vector map(F f,
           vector thetas, int[] theta_ends, ...static data ...)

vector f(vector theta, int fold, ...static data...);

The static data part would be the same on all nodes and it would be the same during all calls of the mapped function. This design would give users the greatest flexibility as we do not force them to put everything into a plain array, but rather one can just use all declared data in whatever shapes are convenient to write down the problem. With C++11 we could use parameter pack template arguments to make these sort of things work.

I am sorry, but I missed yesterday that the data can be assumed to be already on the node and as such it can be complicated. In a way this introduces a state-full function, because that static data needs to be available - BUT stan-math when used with MPI is anyway a state-full thing, because that waiter object must be in place. As such its a small step to expect that static data is already loaded on the worker.

In short: stan-math anyway becomes state-full by going MPI; so we can leverage that to get the static data. This will make the function so much easier to use for stan and stan-math users.

Sebastian

Is ...static data... going to be a sequence of data or transformed data variable names?

I take it the idea is to then generate a class like the model class in which those variable names are instantiated with actual data from the member variables rather than being passed as arguments. And then f wouldn’t actually be passed those arguments, it’d just access them locally.

That probably wouldn’t be too hard to generate. So if we had something like

map(foo, thetas, theta_ends, x, y, N);

then we’d generate a functor class like

template <typename T>
Matrix<T, -1, 1> foo(const Matrix<T, -1, 1>& theta, const MatrixXd& x,
      const VectorXd& y, int N) {
   ... code generated for body of foo ...
}

struct mapped_functor {
  MatrixXd x;
  VectorXd y;
  int N;

  template <typename T>
  Matrix<T, -1, 1> operator()(const Matrix<T, -1, 1>& theta) const {
    return foo(theta, x, y, N);
  }
}

And then the mapped functor is the thing that actually gets called in the child main processes. So it would require the main program for the child to create the mapped functor.

P.S. Stan math already is stateful—autodiff is stateful and it’s state is held in a global singleton. Also, any uses of the LDLT types are stateful in that they hold onto data and/or solutions.

1 Like

YES!

That is exactly what I meant. The ...static data... are data arguments declared in data or transformed data block.

This design would give users the easiest way to facilitate parallel computations. Making it possible to pass through arbitrary data structures which have been defined by the user will allow users to write down their model in a very natural way and users aren’t forced to cram everything into those rigid int / real arrays.

I think that this is the design we should go for. This could also be the design pattern which we put into stan-math, but maybe we can easily support both cases - the static data only case and the retransmit data case.

Sebastian

OK, I think we’re in agreement and I think it’s possible.

If the design works out, we might be able to use the Stan math function for the Stan language. I still don’t quite see how that would fit together as the functions are getting configured out-of-process and I don’t see how to control that from within Stan.