MPI Design Discussion

wds15 · September 1, 2017, 5:10am

But wouldn’t my proposal to add a map_rect_lpdf function give us the same speedup and even more? In that case N=1 and we can even do lots of aggregation locally.

Bob_Carpenter · September 3, 2017, 12:06pm

Yes, and that’s why I think we should go ahead with it. It’s much simpler and still lets us get the speedup in the main case for which we want it.

I really like what @bbbales2 is suggesting in general, and we should definitely keep it in mind for Phase II.

wds15 · September 3, 2017, 2:09pm

Hi!

Great! I agree in that this should go in quickly. There are two choices to make in going forward which I tried to analyze using benchmarks. So we have the choices:

choose between
- map_rect(f, real[] eta, real[,] theta, real[,] x_r, int[,] x_i) which allows shared parameter vector eta
- map_rect(f, real[,] theta, real[,] x_r, int[,] x_i) which does not allow a shared parameter vector eta
add in an additional map_rect_lpdf which does direct aggregation of the user function output. This version can aggregate directly the function value and the partials of the shared parameter vector.

Now, the example I used is an analytic 2-cmt oral PK model which has 5 parameters in total. Of these 4 are not subject specific while 1 is subject specific. I do run the program with J=64000 subjects on 20 cores. The timing for 1000 transitions using 10 leapfrog steps per transition would take... is

6300s for serial analytic MPI, cpu=4, shared, not summed
7100s for serial analytic MPI, cpu=4, not shared, not summed
6400s for serial analytic MPI, cpu=4, shared, summed

So either I have done some mistake or this simply means that the gradient evaluation is not any more a bottleneck! Only the trick with a shared parameter vector gives some speedup, but considering the huge size I had to set in order to see a difference, we could opt for not supporting this for now. Hence, copying the shared parameter vector J times is not so costly given that the gradient evaluation is made so much faster using MPI. The same seems to be true for the summed version (using the map_rect_lpdf function) as well. This may look different whenever higher order AD is used, since then more aggregation can occur.

However, all in all, I would conclude that we should simply go with the original proposed design of

map_rect(f, real[,] theta, real[,] x_r, int[,] x_i)

This is a lot simpler to implement, since there is only a single argument which can be a var… and any other optimization seems to be not useful since the gradient evaluation is not anymore the speed limiting operation. At least this is what would be my take.

Let’s discuss tomorrow.

Best,
Sebastian

wds15 · September 3, 2017, 2:19pm

Just to make sure: The above timings are for 20 CPUs as written in the text. Previous tests I did with 4 CPUs and I forgot to update that in the bullet point list… sorry for that.

Bob_Carpenter · September 4, 2017, 11:49am

Are you testing you get the right answer?

Can’t you analyze the operations involved in terms of size of communication and size of gradient calculations? Once you have the expression graph, it’s a bunch of multiply-add operations, but each one’s a virtual function call (though if they’rea ll the same vari instance, branch prediction will probably be possible).

You can edit previous posts. Click on the ... control.

wds15 · September 4, 2017, 1:17pm

Editing never works for me… I click it and then I can edit, but I am not allowed to save.

Anyway… of course, I do check that results are correct using

non-mpi as reference
gradient test mode to ensure that things are ok

Now, I must have coded something odd with the direct accumulation case which runs a lot slower than not directly accumulating - which is very strange. Maybe I stepped into a MPI pitfall (you have to make sure that the lower-level C functions are used).

I need to check that.

wds15 · September 9, 2017, 9:13am

Bump.

I just updated the concept/mpi branch on stan-math to be in line with develop. Turns out that boost 1.64.0 has a BROKEN MPI library shipped. I bumped boost to 1.65.1, but I think it should work with 1.65.0 as well (which is what is in BH). Annoying.

Apart from that I also integrated by code properly with the stan-math base. I mean what is implemented could be integrated to stan-math. Note that the branch currently implements a version with a shared parameter which I would suggest to drop and stick with a simpler version for now which only has per-job parameters for simplicity. This may lead to a 2 digit percent performance loss, but I would prefer to get this function in faster in a simpler version… as we talk about large 3 digit percent speedups here.

In addition, I would suggest to start coding the more user friendly version already now. This is urgently needed, because programming the map_rect stuff efficiently will be a pain (data must live in x_r arrays and you cannot even assign it to temporaries without paying the price to increase the AD stack size considerably). The main problem to solve is to generate functors which are default constructible, but still have the immutable data available. A first approach could rely on a singelton which holds a map of data identifiers to boost::variant instances which are references to the actual data.

BTW, I think we will be able to get MPI on Linux and MacOS running straight away. I haven’t looked yet into Windows at all, so that remains to be seen… of course, I can only refer to the compilers I have used so far (clang++, gcc++, Intel).

Bob_Carpenter · September 9, 2017, 5:53pm

Did you have a specific design in mind? I don’t see anything in the Wiki page on MPI.

We’ve talked about all sorts of things, so I’m not exactly sure what you have in mind.

By default constructible, you mean with the nullary default constructor that doesn’t do anything other than default initialize class member variables? The only way that’ll work is if that singleton is a global that’s scoped in such a way as to be available when the functor gets compiled.

Why would we need boost::variant?

wds15 · September 10, 2017, 12:44pm

Hi!

I have linked that design which you wrote from the wiki. It’s a few entries above: MPI Design Discussion

My though was to have a map of

data-item-id -> boost::variant which holds a ref to the data-item (which is a data or transformed data object)

This global map would then allow us to default construct the functors and give us access to references to the data we need to pass into the functor. I think your understanding of what I intend to do is right. However, more clever approaches are certainly welcome.

wds15 · September 11, 2017, 7:05am

Hmm… thinking a bit about how to apply this mpi map I figured that the Stan language can possibly be hacked for it’s shortcoming: My issue with the current design from a user perspective is that it is very hard to code it with data efficiently. So if you have the mpi-mapped function, then this is inefficient:

real[] foo(real[] theta, real[] x_r, int[] x_i) {
    real[10] data = x_r[1:10]; // expensive!!!
   ....
}

So imagine x_r contains all the data to use, but essentially I am not even allowed to assign subsets of the data to sub-expressions without paying the price to bloat the AD stack (data gets declared internally as AD variable).

This leads to very cumbersome code to write for users…but I think we can work around it by nesting this:

real[] bar(real[] theta, real[] data1, real[] data2) {
   ...
}

real[] foo(real[] theta, real[] x_r, int[] x_i) {
    return(bar(theta, x_r[1:10], x_r[11:20]));
}

While one still has to do the indexing stuff, it is possible with nesting function calls to get a declared data1/2 variable which is truly data. That is very valuable for coding up problems. @Bob_Carpenter: Am I right here? If so, then the original map_rect is maybe not so bad after all, we just need to educate people to code it right.

Bob_Carpenter · September 11, 2017, 10:51am

Yes, that’s right. If you declare a variable, then its type automatically gets promoted. I’m going to create an issue to fix this.

We should be able to have

real[] foo(real[] theta, data real[] x_r, data int[] x_i) {
  real[10] data  = x_r[1:10];

(using the new data qualifier which forces those arguments to be data only) translate the second line as

vector<double> d(10);
for (size_t i__ = 0; i__ < 10; ++i__)
  d[i] = x_r[i];

In general, I think we can use typename promote_args<T1, ..., TN>::type for non-integers where T1, ..., TN are the types found in the right-hand side expression.

And yes, if you just have x_r[1:10] that should just be data.

I created an issue on stan-dev/stan:

https://github.com/stan-dev/stan/issues/2395

seantalts · September 11, 2017, 3:43pm

I like the design you linked with generated functors that have the data variables as members, and a map call that takes a function whose arguments are just the parameters. With C++11, we could use lightweight closures in a similar way (and thus the math library use cases could also be fairly simple with closures). Are you planning to move forward with that design as a replacement for the current one (which I believe is a map_rect that takes some data and data indices as arguments as well)? I would support this though I’m not sure if that’s what you were suggesting.

I also didn’t understand why one needs boost::variant in the following passage:

wds15 · September 11, 2017, 3:52pm

Hi!

We can discuss the boost::variant thing at some point in a call possibly.

I understood that the plan is to move forward with a map_rect in any case. This implementation will be part of stan-math… and the map_rect thing is straightforward and quickly to integrate.

The more user-friendly version will land in stan as it will depend on a lot on stan parser/language features.

So, first the map_rect thing, then the user-friendlier version which is hard for Bob to deal with as I understood. It’s doable, but requires some extensive parser work as I understood.

Sebastian

seantalts · September 11, 2017, 6:56pm

I’m finally diving into this deeply today, and had a couple of questions:

Did you already try a more naive version that just uses boost::mpi::reduce under the hood? It seems like it would be much simpler and you might even be able to do most of the optimizations you’ve already got in here (and even if not, I’d be curious about the sort of first correct simple “pre-optimization” numbers and code as a baseline, recalling that premature optimization is the root of all evil). I could sketch up something if you like - one notable feature it took me a while to realize is that you can use a global data singleton in a very easy way with reduce:

image.png1154×342 30.9 KB

But maybe this doesn’t work with the way that we load data and use MPI?
What exactly are the optimizations you’re performing so far? I see two main areas:
a. avoiding some excess autodiff memory allocations (pretty unsure about this one)
b. sending data just once (think maybe I understand how this works, though not the problem space)

I also came up with some questions that were easier to write on a diff so I made a fake pull request in my own repo to record these types of questions and comments: https://github.com/seantalts/math/pull/2/files

Please forgive me if you’ve already explained these things; feel free to link me to stuff. I’m also happy to hop on a call if you think that’d be faster.

wds15 · September 11, 2017, 7:38pm

Hi!

As answered, I tried to use mpi::reduce, but it did not give me good speed. I tried to answer your questions as good as possible. Let me know if you have questions on my thinking. Ahh… one thing: A reason to avoid the reduce is that you need anyway another gather call. So having a single gather call maybe better after all, given that latency is a problem for MPI, but not so data volume transfer.

Sebastian

wds15 · September 11, 2017, 7:42pm

Sorry, I forgot these: Yes, I try to keep the AD stack as small as possible and I also try to minimize the communication on the MPI bit (so the number of MPI commands and the data contained in the messages).

BTW, if you see a nice way to code the mpi_rect for the shared and the per-job parameter version (the version with an eta like in the code you looked at), that would be great. What troubles me are the mixed cases which either requires a lot of copy+paste (yack) or some clever programming scheme which I haven’t figured out yet.

Thanks much for looking into this!

seantalts · September 11, 2017, 8:58pm

I’ll play around with it some more. I think I see where you were using reduce to implement mpi_rect_lpdf. I’m wondering if you tried the version of reduce that uses statically available data as in the image posted above and if that’s even possible with what we’re working with given the way we read in data, etc.

Can you give me a little more context on this “per-job” vs. “shared” parameters version? What do you mean by that, why do we need two, etc? I’ll play around with things a bit more and see what I can come up with but I think I’m a little lacking on how these will be used.

wds15 · September 12, 2017, 6:56am

Here is an example of shared / per-job parameters:

github.com

stan-dev/cmdstan/blob/feature/proto-mpi/examples/mpi/oral_2cmt_mpi2.stan#L321


real eta_v[4];
real dose0_v[J];
eta_v = exp(log_eta_v);
dose0_v = exp(log_dose0_v);
//print("yhat = ", yhat);


//reject("OK, we are good for now");
}
model {


if(use_shared) {
  real Theta[J,1];


  for(j in 1:J) {
    Theta[j,1] = dose0_v[j];
  }


  if(use_sum) {
    if(use_mpi) {
      target += run_mpi_function_sum(eta_v, Theta, x_r, x_i);
    } else {

The example program could be a bit large and difficult to digest. So here is the take: I have written this map_rect function with hierarchical models in mind. So imagine we have N subjects and we model their likelihood using S parameters which are the same for all N subjects and with E parameters which are specific to a given subject (in total S + N* E parameters). If you would only have a map_rect version which does not allow for shared parameters, then we would have to

generate N*S copies of the shared ones and send out N * (S+E) parameters
implicitly increase the AD stack size by a (N-1) * S variables on the root

With the shared map_rect version one can avoid the above two steps which improves performance due to reduced communication cost and smaller AD stack size.

Now, there is even another application case which I haven’t thought about in the beginning: Assume we have huge data set for which we don’t even do a hierarchical model, but evaluating the likelihood would take ages on a single core. With a map_rect which supports shared parameters, this would be straightforward to do in an efficient way. That is, we would split the data set in a few slices and send out the shared parameter array to each node (and not send any slice-specific parameter array at all).

So shared parameters improve performance for cases where you can apply them - and I think in about any problem there will be shared parameters. This is just the more general version of the function and if you see a n easy way to handle the mixed cases (vv, vd, dv) with a single and tight code base, we should do that immediately.

I hope this clarifies your understanding… I am happy to have a call.

seantalts · September 12, 2017, 2:33pm

I see - seems like the shared parameters can be thought of as effectively another type of data in these cases (versus a collection we are mapping or reducing over), is that right? Seems like the functor or closure approaches we talked about above would be the natural way to do it once we have language support.

Did you see my question about reduce above?

wds15 · September 12, 2017, 4:12pm

shared parameters… to me its not quite data, but you are right in that a closure thing where you curry the shared arguments to a functor would be the concept to use. However, I doubt that this will ever work for these functors which we use for MPI. The reason is that we only ever transfer types, but not any serialized functor over the wire to the childs.

I haven’t tried any static reduce, no and I am not sure if that’s applicable. My general take from the _lpdf thing which does accumulation is that it apparently does not buy you much performance. So the user should really just program up a map_rect where each function call returns an array with a single element which is the lpdf value. Aggregating that directly does not make a huge difference for the performance of the programs which I tested. It can well be that I did a mistake in my programs though such that performance was spoiled. However, as of now we should focus on the map_rect (ideally directly with a shared parameter interface, but only if that is doable quickly).

After all we are talking here about days to few h compute time improvements… and finally I can leverage our 1.5k CPUs here.

Topic		Replies	Views
MPI todo list / Stan language specifics Developers	1	689	January 27, 2018
MPI roadmap Developers	16	1846	March 11, 2018
MPI Stan + cmdstan General	8	1186	June 15, 2018
Extending reduce_sum to use MPI Developers	5	486	June 1, 2021
Parallelization (again) - MPI to the rescue! Developers	52	5102	June 27, 2017

MPI Design Discussion

Related topics