What I’m looking for is what the Stan code is going to look like, not the C++. I want to see how the user writes something that accesses data into all this. Then I think I’ll be better able to think about what we need to support it. As is, I don’t get this binary_data_function
thing at all.
MPI Design Discussion
Hi Bob!
The binary_data_function
struct is just a helper object (as is the mapped_functor
). It serves the purpose to communicate what is to be done when using type only information. A lot of this extra code needs to be generated from the stan compiler. Here is a Stan program as it could look like:
functions {
real hard_work(real param, real[] bar, int[] ind) {
return(param + bar[1]);
}
}
data {
// note the order of the declared data is later the index to access
// the data in the global and static tuple
real foo:
real bar[3];
int ind[2];
}
transformed data {
}
parameters {
real theta[1000];
}
transformed data {
real stuff_for_model[1000];
stuff_for_model = mpi_map(hard_work, bar, ind);
}
model {
theta ~ normal(0,1);
}
There are a few (over) simplifications here done. As we pass in the whole data into each evaluation for a parameter, we should also pass in the job id (1 to 1000) as integer to the hard_work function.
I hope this makes it more clear where I am heading. I think all pieces are now there to do it.
I want to summarize where we’re at in terms of proposals.
For each of the three concrete proposals, I provide the signature of the map
function as it would look from the Stan language and the definition of the signature for the function argument f
.
In all cases, the rough structure is a mapreduce that follows the actual map function being defined in the Stan language:

a root process (rank 0)
 sends operands to child processes
 receive value and derivatives back from child processes
 integrate value and derivatives into expression graph

child processes (rank > 0)
 receive operands from root process
 calculate value and derivatives
 send value and derivatives back to root process
Rectangular, retransmit data
vector[] map(F f, vector[] thetas,
vector[] xs_r, int[,] xs_i)
vector f(vector theta, vector xs_r, int[] xs_i)
PRO
 function is simple and encapsulated
 clean math library implementation
 doesn’t go beyond what already exists in Stan language
 could use standalone function implementations as is for the workers (non root)
CON
 retransmits data
 rectangular only, so may require ugly and expensivetotransmit padding
Ragged, retransmit data
vector map(F f,
vector thetas, int[] theta_ends,
vector x_r, int[] x_r_ends,
int[] x_i, int[] x_i_ends)
vector f(vector theta, vector xs_r, int[] xs_i)
PRO
 all of the pros of the rectangular version
 function is same, so still encapsulated
 allows arbitrary raggedness (user must know ragged result bounds)
CON
 retransmits data
 awkward raggedness without builtin ragged structures
 return sizes implicit
 could
int[] result_ends
argument for anticipated sizes—check results match
 could
Rectangular, Fixed Data
The data variables x_rs
and x_is
must be the names of data
or transformed data
variables, so that they can be loaded once and reused
vector[] map(F f, vector[] thetas, vector[] x_rs, int[,] x_is)
vector f(vector theta, vector x_r, int[] x_i)
This one has a map that just sends the data in once, but x_r
and x_i
must be the names of variables in the data/transformed data block. Each child process will need to load the data from the model, but then only needs to hang on to its own slice, x_rs[k]
and x_is[k]
. Then the function can be taken from the standalone function compilation.
PROS
 reads data once on each child process
 easier to implement than closure
 easy implementation without MPI
 should perform better than data transmission each time
CONS
 has to instantiate all of the model’s data on each process, even though it only needs a slice
 back of envelope, this should be OK for PK/PD apps
 might not be so OK for distributing big regressions
 function
f
needs to know how to grab its slice of data from the index
Could generalize this to ragged in the same way as before.
Closure, send data once
vector map(F f,
vector thetas, int[] theta_ends)
vector f(vector theta, int fold);
PRO
 data closure means no data arguments
 could be big win in reducing communications (latency still there in calls and overhead for waits)
CON
 requires functions that are closures over data
 requires major extension to Stan language parser and code generator for support and a lot of doc explaining them
 generate as member functions rather than static
 might like to have in language anyway
 could require too much memory per worker as each worker gets everything in the instantiated model
I read the Boost MPI interface and think I finally see where Sebastian’s coming from with all of this. I updated the design discussion in the previous post to include what I think is Sebastian’s preferred approach, which is the rectangular/fixed data. We could also do it ragged.
It doesn’t seem like this will require too much work to implement and it’s relatively clean for the language.
What it doesn’t do is provide a neat way to build this on top of a Stan math library function. At least I don’t see how to do that.
That was a long meeting yesterday, but very productive.
For stanmath there is a neat way with closures, I think.
We decided to create an object which is part of stanmath which will wait for those command
type objects and execute these. All functors will be run on the child in the context of that class. Let’s call this the waiter
object.
We could then place our static data as a C++ tuple into the waiter object. The waiter object can then make the static data available to the called functor via the tuple which contains the static data at the nodes.
The upside of this approach is that no data ever gets transmitted as we can assume that the data is read in locally at the child. Hence, the data can be anything and it can have any complexity, since we never transmit it. The function signatures would be
vector map(F f,
vector thetas, int[] theta_ends, ...static data ...)
vector f(vector theta, int fold, ...static data...);
The static data part would be the same on all nodes and it would be the same during all calls of the mapped function. This design would give users the greatest flexibility as we do not force them to put everything into a plain array, but rather one can just use all declared data in whatever shapes are convenient to write down the problem. With C++11 we could use parameter pack template arguments to make these sort of things work.
I am sorry, but I missed yesterday that the data can be assumed to be already on the node and as such it can be complicated. In a way this introduces a statefull function, because that static data needs to be available  BUT stanmath when used with MPI is anyway a statefull thing, because that waiter
object must be in place. As such its a small step to expect that static data is already loaded on the worker.
In short: stanmath anyway becomes statefull by going MPI; so we can leverage that to get the static data. This will make the function so much easier to use for stan and stanmath users.
Sebastian
Is ...static data...
going to be a sequence of data
or transformed data
variable names?
I take it the idea is to then generate a class like the model class in which those variable names are instantiated with actual data from the member variables rather than being passed as arguments. And then f
wouldn’t actually be passed those arguments, it’d just access them locally.
That probably wouldn’t be too hard to generate. So if we had something like
map(foo, thetas, theta_ends, x, y, N);
then we’d generate a functor class like
template <typename T>
Matrix<T, 1, 1> foo(const Matrix<T, 1, 1>& theta, const MatrixXd& x,
const VectorXd& y, int N) {
... code generated for body of foo ...
}
struct mapped_functor {
MatrixXd x;
VectorXd y;
int N;
template <typename T>
Matrix<T, 1, 1> operator()(const Matrix<T, 1, 1>& theta) const {
return foo(theta, x, y, N);
}
}
And then the mapped functor is the thing that actually gets called in the child main processes. So it would require the main program for the child to create the mapped functor.
P.S. Stan math already is stateful—autodiff is stateful and it’s state is held in a global singleton. Also, any uses of the LDLT types are stateful in that they hold onto data and/or solutions.
YES!
That is exactly what I meant. The ...static data...
are data arguments declared in data or transformed data block.
This design would give users the easiest way to facilitate parallel computations. Making it possible to pass through arbitrary data structures which have been defined by the user will allow users to write down their model in a very natural way and users aren’t forced to cram everything into those rigid int / real arrays.
I think that this is the design we should go for. This could also be the design pattern which we put into stanmath, but maybe we can easily support both cases  the static data only case and the retransmit data case.
Sebastian
OK, I think we’re in agreement and I think it’s possible.
If the design works out, we might be able to use the Stan math function for the Stan language. I still don’t quite see how that would fit together as the functions are getting configured outofprocess and I don’t see how to control that from within Stan.
I see two solutions for Stan to solve this out of order creation issue of the functors: Either we make these mapped functors default constructible by using a singleton design or the waiter
object has an instantiated version of the mapped_functor which contains the user data already. This last option could be implemented with a fusion map, for example (the functor type is the key of the fusion map).
For stanmath I am thinking about a mechanism where the map implementation itself deals only with parameters and for the case with data we provide a mapped functor which transmits data instead of using the local one.
I guess we should get active again. After putting some more thought to it, I suggest to start with an mpi implementation which lives in stanmath and proceed as
 implementation of a mpimap which takes a functor accepting only parameters  I do think of this thing like a parallel version of a jacobian function applied to a large set of parameter vectors.
 an implementation which takes parameters and data will be included as well. This stanmath implementation will always retransmit the data to the childs and will be based on the code of (1)
 the stanmath implementation will include a mpi_commander object (or however we want to call it) which will control the program flow on the childs. That is, this thing will wait for work chunks from the root process and when deconstructed it will free all mpi resources.
 after that we start with stan which will expose the functionality as discussed; by that I mean it will allow for static data which is loaded on the childs and the static data will never be sent over mpi.
So unless someone objects against starting with a parameter only mpi map, I will start with that. I know we discussed to go with a parameter and data version in our last meeting, but from a design perspective I find this approach more appealing (if things work out as I have in my mind).
At this stage its early to write an issue. My suggestion is to start a new branch where I put in draft code which nail down the interfaces, but the functions themselves are only filled with pseudo code. Once that is up, it would be great to have discussion to settle on this design to start the actual coding.
Does that sound like a plan to move forward?
Sebastian
I think I understand, but if you could write out the signature of the Stan math function, it would help make sure we’re on the same page.
If there’s just one big set of fixed data, then you can always encode it in the functor.
Do you see some version of this function being exposed to Stan programs?
Mitzi’s almost done with the dataargumentonly constraints so all this will work with functions with appropriate compiletime type checking.
The intent of the prototype branch is to spell out first the function signatures only and put dummy code into it. Writing this is easier for me with git/emacs instead of this discourse…but I can link those files to this thread, of course.
A always data transmit version can be useful to have in Stan as well, since there would be no restrictions as to where this thing is called (data can be anything). I would still prefer to put resources on the more flexible static data only version we discussed above. So unless it is really straightforward to expose this to the Stan language, I would save our time for the flexible static data thing. BTW, great to know that Mitzi is laying the foundation for this!
One more point: In the nonstatic data variant I think we have the choice between
 to sent always all data to all childs and give the supplied user function the batch index of the processed parameter set
 or we split the data like the parameters up (requring ragged arrays or rectangular array shapes) and then we do not need to pass the batch index to the user functor
Option 1 would be consistent with what we have planned for the static data case. Option 2 should be more efficient since less data is being transferred over MPI, since each child only recieves the Nth part of the data (if N is the number of jobs). Any thoughts or preference here?
Ah, and one last point to settle one: Should we use vectors and matrices (Eigen stuff) or arrays (vectors)? I think I always put down function signatures with arrays, since this is what we have done for ODEs  but your dummy prototypes always used Eigen stuff. I don’t mind either way, but if we could agree on that now that would be good  so arrays or Eigen?
Bonus implementation question if the answer is Eigen: Is it OK to still use Eigen data structures internally which would make my life simpler in implementing this…or should I minimize dependencies and then stay away from Eigen in case we go with an array function signature?
Thinking through it I ended up drafting a map_rect_mpi function which takes rectangular parameters and rectangular data as input. Here is the dummy code:
https://github.com/standev/math/blob/feature/conceptmpi/stan/math/prim/arr/functor/mpi_worker.hpp
Feedback would be great.
This what we have discussed as I understood. My recommendation would be now to program this into a prototype and rerun the performance test I did. The old performance test used a static data concept and avoided resending data and I think knowing the price of resending data should be known.
Wasn’t there a way to do this without using the generic Boost serialization? I think the shapes are all known for what’s being sent.
It looked like using the serialization framework was slow in their evaluations (I very much appreciated the way Java handled serialization—it could be generic or handcoded depending on how fast you wanted it to go).
Cool. Prototype runs (I just pushed to math)!
Tomorrow I will try to setup a benchmark along the lines of my first toy example.
BTW, currently I plan to make the prim version of the map a serial only implementation to begin with.
It would be good if we can refactor the code such that we don’t need to copy so much code around when we want to mpify something. Let’s see.
Sebastian
Do you think that we should maybe use hashes to detect if some chunk of data has already been send?
The idea is to cache data which has been send and use hashes to detect already send data. This way we end up with a sendonce data version.
Anyway, I spend time on refactoring and ended up with a version which can now be used rather flexibly. That is, the current version should allow to easily let sent a data block once and then retransmit new parameters.
Can’t we just check the memory location? If it’s data or transformed data, that’s not going to change. If it’s not, we probably just want to resend it.
I think we want serial only versions of everything. Then you can drop in the parallel versions seamlessly where they seem most useful. We do that in some other autodiff settings, I think.
For the first stanmath only implementation it will be hard to handle the static data case. Since stanmath cannot assume that clever things are put in place, we have to treat data as varying from calltocall. Calculating hashes should be fairly cheap vs resending (I will try that out) and could be a nice solution.
In my head we are planning for a basic stanmath version with rectangular shapes as a start and then we are going to add another version which uses static data only and takes a variable amount of static data as input as discussed above. Right?
My guess is that hashing will cause a noticeable slowdown, but it may still be a win. The problem with sending data is going to be partly latency but we have to pay the latency anyway to send the parameters; then it’s a matter of whether data throughput is faster than hashing throughput and that’ll depend on hash algorithm and on network capacity.
Right.