Parallelization (again) - MPI to the rescue!

I think this will work, but I wonder if we can’t improve the user experience of this a bit? That map call API makes very simple use cases look pretty ugly. No other language has such a convoluted parallel map call - is this because we’re working around some missing language features in Stan (ragged arrays / tuples?), or because we need to do something special with parameters vs. data for some reason inside this call? I think a simpler API making use of closures would be a lot better in the simpler case and would still work where you need these ragged arrays or to do special things for params. We could do something like C++ closures where we define a lambda that specifies which variables we need to ship and whether we need references or if copies are fine:

real[] data1;
real[] params;
int[] data1_ends;
auto f = [data1, params](int index) {
  real sum = 0.0;
  for (i in index:data1_ends[index]) {
     sum += data1[i] * params[index];
  }
  return sum;
};
lp += pmap(f, 1:N);

Or something like that?

If we’re working around language features and need to fix those first before we can come up with a good API, maybe we can avoid naming this one map and give it a more contextualized name (maybe pmap_ragged or something) so that when we’re able to write a map that looks like other languages we can just call that one map. I think it’s important to let this get in quickly but lets also leave ourselves a route to polish the idea later without ruining backwards compatibility.

I’d be happy to work on some underlying language features we might need here if that helps.

Ha, you mean an ugly name of an ugly function until we have all language features to come up with a nice map function? Fine with me.

What is making our life difficult is the absence of a ragged data structure type. Having that would avoid those integer arrays for the parameters and the data. I am not sure how one could avoid the integer arrays of f_out though.

Your example with closures looks nice. Where I fail to get my head around for now is how we can map this onto the boost MPI commands scatter/gather which we should feed with essentially continuous chunks of memory to work efficiently. So the interface with boost::mpi is low-level.

With the closure example I was imagining that each node would have all the data, but this might only make sense where each “node” is a forked process on the same machine that shares memory (such that no copying or transmission of data would have to take place in the read-only scenario). I’m not actually sure if MPI has this mode of functionality or if it always uses IPC to send data. And if it is always shipped, I’m not sure what the performance penalty would be and if it’d be worth a nice API.

I’ll talk to Bob about ragged arrays more…

The processes are completely separate and stuff is copied with IPC, I think. However, as all processes start in the same NFS shared directory, it is easy to set things up such that all processes have the same data up to the transformed data block (as you can simply let each worker run through that block).

ragged arrays are already for a while around, but new language features are very costly to do as I understood.

I see. Well, I’m in favor of getting what you have in sooner and not waiting on costly language feature implementations (though perhaps with an ugly name, haha). It seems like the design you proposed requires no additions to the language, which is nice and streamlined to add and will let users like you get results immediately.

Re: your questions: Is it more difficult to let there be multiple regions of parallelism? I don’t think we want to get into nested map calls but if there was one after another, would that make implementation more difficult?

Re: making sure data is constant / that blocks are appropriately distributed - this might lead us to the all-data-on-all-nodes approach, and shipping chunked params with each call?

Yes, what I have proposed should go well with our current language.

Oh, I never thought of nested maps; this is not doable (or at least we would have to map all inner maps to sequential versions). It does make a difference to go from a single map to multiple maps, but not a big one probably.

Yes, all-data-on-all nodes would be easiest. This is only applicable for the case when the transformed data is in scope of the map, I think.

I am really looking forward to this addition. Then I can finally throw those 1.5K cores we have at Stan problems at once.

Just fyi, I’m tracking this thread and thinking about it.

What does “index war” mean?

cool!

index war = inconvenience for the user to collect all those tedious indexes need to code the ragged arrays

these ragged arrays are not nice to handle in Stan, but almost all my code looks ugly because of this… so I am used to that.

Got it. I’m fighting it too and know your pain.

It’d be easier with ragged arrays, so yes, it’s working around language features.

The other problem as you note is that we need to distinguish data from parameters in the ODE system because we need sensitivities (derivatives of output w.r.t. parameters).

And it’s not just pure ragged array problems—it’s that the arguments are really heterogeneous—they might be sizes, arrays, matrices, etc. And we’re forcing them all to be packed into a ragged array.

Yous ee this pattern elsewhere, not just in Stan, when you have to call these higher-level functions. It’s what CVODES and Boost force us to do for the ODE solvers.

Yes, we want to distinguish data from other variables. That’s tied up with MPI and shipping data off.

It took me a while to see what you’re doing, but I take it you’re defining the body of f to be the mapped function; if we wanted to solve an ODE, we’d do that raher than sum. In general, we want to map functions that apply to multiple arguments and return multiple results.

I think we could probably handle the lambda abstraction with closures, but adding the auto stuff to Stan would be much more challenging. As to the map itself, we could take just N as the argument and leave the iteration implicit, or we could take an array of arguments.

A further problem is that within the ODE solvers, we again have to make the data/paramters distinction because some arguments can be only data.

One solution would be to ship each process all the data. The overhead might not even be that high.

Mitzi wants to add tuples, and the big cost will be the up-front cost of generalizing type inference beyond (int, real, vector, row-vector, matrix) x array dimensions. Specifying ragged arrays is another challenge in terms of sizing them. Everything in Stan comes sized, so either we figure out some way to relax that, or we get very hairy size declarations for ragged arrays as in my original design on the wiki.

I’m totally OK with reserving map in the hope that we can just implement:

template <typename X, typename Y>
vector<Y> map(const std::function<Y(X)>& f,
              const vector<X>& xs);

when X and Y can be appropriate tuple types.