MPI Design Discussion

More good news: I got interested in the performance of this MPI stuff on analytic programs. So far I had only tested ODE problems where MPI is easily giving great speedups since the numerical cost per evaluation is really heavy such that the extra communication overhead is not relevant as my previous benchmark has shown.

So I am now calculating the same problem, but instead of an ODE integrator I am using analytical expressions and I have increased the number of jobs to calculate by a factor of 10 to in total 3200. Here are the strong scaling results:
mpi2_scaling_a_strong
These results are much better than I expected them. We are going from 1 core 2.4h run time down to just 10 minutes on 20 cores which is a whopping 14x speedup!

However, thinking about it I figured that we can probably do better: The problem with analytical expressions is that they are so cheap to calculate and increasing the number of job chunks leads to an increased communication overhead (to and from the workers). But we can do a lot better:

  1. We should change the design such that the there is a shared parameter 1D array theta and a 2D array eta. The theta will be distributed to all nodes and the 2D array will be chunked as usual. This way we can (depending on the problem) cut down the communication cost to the workers.

  2. (and this is a BIG one): Currently, the function is designed to return all outputs which are generated. That’s fine and very useful, but extremely wasteful in a Stan program where we are only ever interested in the accumulated value of the log prob density and its derivatives. In short, we should introduce a map_rect_mpi_lpdf function which would expect that each function just returns a single value which is the accumulated log-prob for a given work chunk. This will dramatically reduce the communication from the workers backwards to the root. Now we do not need anymore to communicate all results and all partials, but we can directly on the workers run the reduction which is to sum up all terms respectively. As a nice by-product this will reduce the size of the AD stack enormously on the root node to a single entry!!!

I am currently running a benchmark where point 2 is approximately being done by just returning a single value per job chunk. The performance is much better as it looks, I will see. Now, this is quite exciting - and this is technique to run the reduction immediately will be much more beneficial for higher order auto diff. Simply because the reduction can happen directly on the local workers. I really hope I haven’t done a mistake in my thinking here… it sounds to good to be true.

Your reasoning is sound—that’s a huge deal for the two reasons you mention (cut down communication in MPI, reduce autodiff stack size)

Is it not possible to accomodate this case using a single-element vector as the return value? Or is that not possible or still wasteful? Sounds like that’s what you did. But then it souns like you want to add a second signature or change the original signature.

It’d be helpful if you can update the design document or post something more explicit here.

Great.

You could just return a single element real array, yes… I have tried that and it works nice. BUT: This is still wasteful. Say you have 10k jobs which you place on 10 CPUs such that each CPU has 1k jobs to calculate. Returning a single value per job leaves me with returning 1k results per CPU back to the root (and the partials). However…we can do much better because the reduction can be performed on the node directly! Hence, the right function will perform the summing of the 1k results on the node as things are calculated and then from the 10k jobs we will only have to transmit back to the root 10 values which get combined on the root (instead of returning 10k results in total)! That’s a huge gain. Now, the user could block things such that he lays things out such that on 10 CPUs he only creates 10 different work chunks…but that is very tedious to code.

I will update the design page when I have time it (possibly next weekend).

Grr… I am mislead here. We can accumulate the log-prob value across all jobs, but we cannot sum up the partials. So there could a benefit from a reduced AD stack size, but the overall communication backwards will not be a lot better as the partials are making the communication costly.

So the function signature I have in mind now is this:

   real[] f(real[] theta, real[] eta, real[] x_r, int[] x_i) { }

   real[] map_rect(F f, real[] theta, real eta[,], real[,] x_r, int[,] x_i);

So, the theta parameter array would be given to all functions as shared parameter array and the 2D eta array would be split and given row wise to the f function.

Makes sense?

So right now you have a function that takes N inputs and produces M outputs. You’re distributing K of them around a cluster.

So distributing input (since you’ve shared the constant data) is pretty easy cause it’s KxN. The issue is that each of the N function calls produces an MxN Jacobian which you have to compute and ship back to the head node. So that’s KxMxN communication?

Why not hook the MPI stuff up through the reverse mode autodiff chain calls? It’d mean more individual comm calls, but probly significantly less data. For the forward pass you’d have to send from the head node KxN things and receive back KxM things, but for the reverse mode autodiff you’d only have to send KxM adjoints and then receive KxN adjoints back from the cluster (you’d have to do basically 2xMPI barriers for the forward and reverse passes, but it’d be less stuff if that’s the problem).

That sound right?

Why not? If you have two terms a and b contributing to the log density, then

d.(a + b) / d.x  =  d.a / d.x  +  d.b  / d.x

That’s below the level at which the Stan language operates, so there’d be no way to specify that in the language. This is far too hard to automate by code analysis, I think.

Yup, this what i had in mind as well, but then i got confused by coding things up. But if this works like this, then we have a major gain here to make as the accumulation can then happen on the local workers.

I just mean that when you make the call, output = mpi_map(function, input) or whatever that the .chain call attached to the output variable on the autodiff stack does something like:

  1. main process sends all the adjoints to the workers
  2. the workers inject those adjoints into autodiff stacks the built on the forward pass
  3. the workers chain back those adjoints to the inputs of the mapped function
  4. main process collects the adjoints from the inputs of all the workers

Not that mpi_call themselves are being autodiffed or anything. There’s the issue of workers having to save their states between the forward computation and then the chaining, but that doesn’t seem impossible to solve.

It’s probably not going to be worth either the communication or synchronization cost, much less both, if we already have the partials. Then it’s just a bunch of multiply-adds in the reverse pass, where the main cost is virtual function calls (which would be even more costly under MPI).

Now if these had some regular matrix structure or were easy to parallelize, then it might be possible to get some benefit out of parallelism.

Stan does allow lazy calcualtion of the gradients, but I doubt that’s what we’ll have in most of the cases here we care about parallelizing, where we might as well just compute the partials as we compute the function values and ship everything back.

Maybe I’m missing something here.

Well the latent point was really:

if we already have the partials

If there are M outputs, then getting all the partials in the forward pass requires KxM reverse mode gradients of the function that is being executed in parallel.

If this happens in the regular flow of the reverse mode autodiff, you only need K reverse mode gradients of this function (cause all the adjoints are with respect to the target log probability). There’s computation savings too.

I think in @wds15’s original use case (for the implicit ODEs), it wouldn’t really make much of a difference because the sensitivities for the implicit ODEs are computed in forward mode anyway, but for more general Stan models, it makes sense to consider the reverse mode stuff again?

In general, they won’t all be w.r.t. the target log density (it’s a probably function, but not a probability, just a density—our terminology is all messed up—I should’ve used lpdf instead of log_prob from the get go, but I was anticipating discrete parameters in the beginning).

But if you have a fat K x M Jacobian, then I still don’t see the savings in being lazy about it. I’m clearly missing somethign here.

In general, they won’t all be w.r.t. the target log density

Aren’t these the only derivatives NUTS cares about (edit: cares about → needs)? Derivatives of target log density with respect to all the parameters?

I’m clearly missing something here.

Haha internet is not whiteboard. Makes it harder. I could easily be missing something too.

In aggregate, yes. But they’re made up out of blocks of the expression graph. And what we’re trying to do is distribute those blocks. So what Sebastian’s doing is computing the solution to an ODE, which playes the role of the linear predictor x * beta in a linear regression. The Jacobian of x * beta plays a role in the larger calculation.

the issue now is how we can organize things so that the Jacobians getting passed around are minimal. One way to do that is have a single log-density output of which the ODE derivatives are an encapsulated component.

This is very much like TensorFlow architecture by the way, which as far as I understand it, is a way of distributing and aggregating Jacobian calcualtions with the obvious applications to deep belief nets.

Hi Bob!

I should write this up with latex…but i think we can do the accumulation of the partials for the shared parameters only. For those parameters which vary between work chunks we need each term separatley. Thats still a big deal if there are a number of shared parameters which there should be given that we calculate the full log prob per item over which we parallelize.

And i guess you are right in that we have rebuild tensorflow for the poor.

Sebastian

Not for the poor. For those who want luxury, reliability, and ease of use.

See Neal Stephenson’s essay, In the Beginning…was the Command Line. I think we’re the Mac OS in this picture, i.e., a European luxury car, whereas TensorFlow is the command line (Linux), i.e., a tank.

Most users don’t want to learn how to drive and maintain a tank.

Of course, the analogy is flawed because all of these systems (OK, not NONMEM) are open source, so in this case, both the luxury car and tank are free.

the issue now is how we can organize things so that the Jacobians getting passed around are minimal. One way to do that is have a single log-density output of which the ODE derivatives are an encapsulated component.

Yup, and one way of doing this is not sending the Jacobians themselves to the host, but by involving the workers in the chaining (whereby the host just transmits the necessary adjoints to the workers, which are much smaller).

In aggregate, yes. But they’re made up out of blocks of the expression graph.

But I think we don’t want to actually be computing the Jacobians of these blocks (as much as possible)? Unless the expression graph has one output or one input, computing the Jacobian of an expression graph during the forward evaluation is going to be more work than passing the adjoints back through that expression graph on the reverse sweep.

With the current ODEs, the Jacobians already exist so this isn’t as much of a deal, but for other calculations that wouldn’t be true.

I’m terrible at all this calc, so I had to write this out a bit moe explicitly. Suppose we have a function f: R^M -> R^N. We’d need to pass back an N x M Jacobian.

If we ship out, then we need to pass the adjoints of the N output variables and then collect increments for the M input values.

Yup, that looks like a huge savings. I have no idea how we’d add the extra synch there, but if we get into this position, we’ll certainly already be assuming we have multiple cores, so it’d be great if we can figure out how to use them to speed up the backward pass, which will be the limiting factor for parallelization.

Hmm…i don’t fully understand the details, but what you suggest would lead to KxM + KxN data to sent around instead of KxNxM + KxN, right?

I need to digest that, but it would certainly be a headache to sync all that. It was already a task to get the workers such that they do what i want. Your suggestion would need more bookkeeping.

Can we go ahead with the simple minded version and revise that later? What would we need to take into account to ensure that the userfacing function can be replaced with another implementation im the back?

The thing is that the simple version is quite mature, I think… and the performance gain is dramatic.

Yup yup, and the other thing is that building Jacobians with little reverse mode autodiffs during the forward pass will take Kx(number of outputs). If you can avoid that and do everything reverse mode, it’ll be K work.

There’d actually be 2x(KxM + KxN) data since you’re communicating on the forward and reverse pass.

It was already a task to get the workers such that they do what i want.

Haha yeah, these parallel things are always confusing, but you made it this far :D. Just wanted to get the idea planted.