Non-exact reproducibility (issue?) with different AD schemes possible?

Hi!

I have made a few odd observations when comparing MPI to non-MPI code:

  • MPI code gives exactly the same results regardless of the # of CPUs …
  • comparing non-MPI code with MPI results gives exactly the same results for the diagnose mode of the model
  • running models with non-MPI vs MPI leads to diverging chains after 100 iterations or so. Differences are very small, but they exist.

The difference I am able to think of which can be relevant here is how the AD graph is represented in memory is differs between the two versions. So the MPI version of map_rect calculates on the spot all derivatives and inserts them as precomputed_gradients into the AD graph. The non-MPI version on the other hand creates a somewhat larger AD graph, since results “past” map_rect are still represented as large graph instead of pre-computed gradients. Having stuff still as big graph and later on calling grad leads to different rounding behavior of the floating point arithmetic when the two different graphs are traversed.

Would others agree on this argument or is this nonsense?

If that is correct, then we have to provide a non-MPI map_rect implementation which does the same thing as the 1-core MPI version (on the spot calculation). Otherwise, exact reproducibility would be lost between MPI and non-MPI runs (however, MPI run results would not depend on the # of used CPUs).

Thanks for any thoughts and comments.

Best,
Sebastian

In order to confirm my understanding I went ahead and implemented a map_rect function which does not use MPI at all, but does evaluate the gradients on the spot in the same way as the MPI code (and then inserting things into the AD graph by precomputed_gradients).

Now the MPI results and non-MPI results are exactly the same. Hence, the AD tree layout matters for exact reproducibility - even if the gradients being represented are the same by the different layouts. This not too surprising after all, but relevant and interesting in the context of MPI which essentially changes the AD tree layout.

BTW, if I remember right, then the version using precomputed_gradients is even somewhat faster than the version which builds deeper AD graphs (at the order of 10-20%). I think this stems from the fact that (a) the nested AD graphs are being recycled and (b) the overall AD graph is less deep. However, this finding would need a bit more testing and it will certainly depend on the specific model being run.

I leave this for the records here. Comments welcome.

Best,
Sebastian

I followed up the observation that precomputing gradients can lead to speedups… and it looks like it can lead to significant speedups! So I am using an analytic example with 16000 subjects each using 5 parameters. The analytic function to solve for each subject involves many algebraic (but easy stuff) expressions. The runtime on a single core differs quite a bit:

  • vanilla Stan code which builds a deep AD tree, 18.2h
  • evaluation of the map_rect call using precomputed_gradients without MPI, 10.5h
  • MPI enabled map_rect on a single core, 10.62h

Below is a speedup graph of what happens when using MPI in this example with more and more CPUs (strong scaling). The two lines only differ by the reference used from above. Due to a speedup at the single core result when going from vanilla Stan to MPI the blue curve shows speedups greater than # of CPUs.

Now, I have to say that while this is great, it seems to me that there is a slight penalty to pay in terms of the overall gradient precision; at least this is my impression when looking at Rhat values. This could be due to the example which I have setup such that quantities to estimate are not of unity order but increase in their magnitude which is certainly not wise to do in a real model.

To me this looks like pre computing gradients in map_rect is a desirable thing. Certainly for this example.

mpi3_scaling_strong

The labels for cyan curve are same as for the red curve…

… the labels are the runtime of the MPI run. These do not change between the two runs. All what changes is the reference. So the red curve is relative to the running time of the Stan program using a single core and a precomputed gradient approach. The cyan curve is relative to the running time of a Stan program using a single core and vanilla Stan procedures to build the AD graph. So the MPI runtimes are being divided by 10h and 18h respectively.

When I did the adjoint sensitivity stuff for the ODEs, the adjoint method seemed to behave noticeably worse than the forward method (edit: numerically worse). They’re different algorithms, so it makes sense they could behave numerically different.

It’s really disconcerting that something like this would be affecting Rhats and divergences and such. Is it the non-MPI or MPI code that is getting the divergences? Can you post the code to the model that’s doing this?

The speedup I think makes reasonable sense too. With the adjoint stuff, I drop-in replaced the reverse mode autodiff in the Jacobian calcs with fwd for like a 50% speedup (or some vague, big number like that), even though it was technically doing more work.

I’m surprised to see it work like this though. I never thought of pulling it out anywhere else haha. I wonder if this is something that @Matthijs could take advantage of? I think he was talking about compile time optimization somewhere.

@wds15, is there a public repo where you have this implemented?

If you’re talking about diagnose from CmdStan, then it’s comparing reverse-mode to finite differences. What exactly are you trying to compare? Have you looked at the element of the autodiff stack?

I think you highlighted some differences below that statement. Questions I have:

  • are the values identical when they’re put on the stack?
  • is the computation done in the same order in MPI vs non-MPI? If not, why not?
  • I didn’t understand the point of the difference in AD graph. Can you explain that in more depth?
  • can you pinpoint a trajectory where we get divergences in the MPI version? We should be able to kick off the trajectory and make sure there’s not (subtle) a bug somewhere.

Here are the things I would believe (but haven’t verified):

  • floating point not guaranteed to be the same in non-MPI and MPI. If that’s the cause, then that makes sense.
  • the autodiff expression graph to be the same.
  • when the head node computes from the autodiff graph, it would produce the same results if the expression graph and all values on the stack were identical

Let me do a quick answer now which maybe already answers your questions and if not I go point by point.

In short: The gradient information represented in MPI vs non-MPI is the same, but it is differently represented since the AD graph is different.

So when I mean vanilla Stan map_rect then I refer to getting the gradients as if you have defined a Stan function which looks like this:

  vector map_rect_stan(vector eta,
                       vector[] Theta, 
                       int[] M,
                       real[,] x_r, int[,] x_i) {
    int J = size(M);
    vector[sum(M)] res;
    int cj;
    cj = 1;
    for(j in 1:J) { 
      vector[M[j]] run;

      run = mpi_function(eta, Theta[j], x_r[j], x_i[j]);
      
      for(m in 1:M[j])
        res[cj + m - 1] = run[m];

      cj = cj + M[j];
    } 

    return(res);
  }

This will create a “deep” AD graph as the AD graph triggered by calling this has many leafs “below” these operands which go into the function.

The MPI way of calculating this is different in that we pre-compute the gradients like this, see here

So the MPI way of pre-computing the gradients leads to a different AD graph as there is for each output a pre-computed gradient associated with the given operands. Hence, there is no deep structure “below” the map_rect induced AD graph which there is in the first place.

Makes sense?

I have to confess that the model I am using for benchmarks is on purpose ill-posed as I am was seeking to have NUTS call a gazillion times the grad method. So this model is not properly put such that I delibaretly have made some of the variables to be on vastly different scales. If you really want to look, see here. There were no additional divergences, just a little worse Rhats, but bear in mind that the sampling phase was very short such that this may go away if one samples properly - which I would expect.

At this time I can only guess at why doing on-the spot gradient caching is a good thing for the map_rect. My hunch ist that this considerably reduces the AD graph size and leads to better memory locality properties of the final AD sweep.

It obviously made this model a lot faster and I would like to try out a few more cases. But, yes, this could be an interesting opportunity for other functions as well - in particular those were a bunch of operands go into it. So the stuff @Matthijs was doing may benefit from this.

So In order:

  1. All values are identical when they go on the stack
  2. Computation is done in a different order:
  • vanilla Stan: AD graph only gets build up when the map_rect is called. All gradients are generated later when the grad method is called
  • MPI code: When the map_rect is called (so during the process of building the AD graph), the function performs immediately all gradient calculations and inserts them as precomputed_gradients. So before the global grad call is triggered, the gradients of all stuff “beyond” the map_rect is already pre-calculated.
  1. I hope I have explained the differences now sufficiently.
  2. I do not get any additional divergencies! There are only minor differences in the HMC trajectories after a few 100 iterations. I do not think that there is a MPI bug or so. I have created code without any MPI, but using the same pre-computing technique. Once I do that, then MPI results and the pre-computed results do match exactly.
  1. floating point stuff can be the same, but don’t need to be; it depends on what we do. For map_rect as we are implementing it right now, there is no difference at all.
  2. The gradient information is the same, but the AD graph is different. This causes differences in floating point behavior which are not about MPI/no MPI, but about the evaluation order of things implied by the AD evaluation scheme
  3. Yes and no. So when you do a grad sweep on the workers and on the root (with same stuff), then you will get the very same result. BUT: Because we change the order at which things are evaluated, we do end up with differences in floating point rounding behavior.

Let me know if this clears things up or if you have more questions on specific parts.

Owww, the complexityyy haha

Oooh, I misunderstood – this doesn’t bother me then. It’s a pretty interesting observation though.

That clears up a lot.

  • Is there a way to build the system so the AD graph is the same?

I’ll need to do some digging, but I would have expected the AD graph to be constructed identically.

Hi!

Sure, the ad graph can be built in the same way…in fact this what I have done already, see Here.

In that file you see the “usual” way of building the ad graph, which is commented out, and the current implementation works the same way as the mpi implementation. What this does now is to make results exactly the same and we get a nice speedup as gradient pre calculation seems to lead to nice speedups; at least in the models I looked at.

Best,
Sebastian

Sorry I missed this.

The answer is that floating point isn’t transitive or symmetric, so any difference in evaluation orders will lead to different results.

Interesting. It was the distinction between the second two I was trying to get at in the meeting. So it’s not multi-processing or hyper-threading. Given that the two versions do the same arithmetic and that the MPI version does more constructing and copying, the only guess I have is that it’s memory locality. If you get much better cache locality when you’re done, that can easily make an 80% (or even 800%) difference in runtime.

This does bring up an interesting issue for Stan in that we can do this ourselves in lots of cases. If we had graphical models, we could even do it automatically in many cases.

That’s certainly possible. It’s a different order of evaluation. The only place I can imagine big differences is with rounding, since there’s a bunch of multiply-add operations to pass adjoints.

I’m pretty sure the divergences aren’t Stan divergences, just the chains behaving slightly differently numerically.

But Sebastian isn’t talking about forward mode here, but nested reverse mode. That too rearranges terms, so maybe it would be more stable. I can’t imagine why that would be in general.

Thanks! That was on my to-do list, but I didn’t realize (even from looking at the branch) that you’d already done it! Much better to have it this way so that we have some optimization latitude and so that results match. The only reason I was suggesting not doing it this way was complexity of implementation and not wanting to depend on any MPI code. You solved the latter problem by neatly factoring.

Sean’s OK with overall design, so now I’m on the hook to do the code review. Hopefully Daniel can help me here—he did this the last time. It sounds like we’re getting close here other than the way we’re going to do builds, which I’m pretty agnostic about out of ignorance.

I also think that very likely memory locality is better in the nested reverse mode approach of the precomputed_gradients approach of map_rect. There seems to be a good deal of optimization in here for other stan-math functions and analyzing the AD graph seems to be a promising approach for greater performance. However, to me it is still a bit early to conclude as I don’t have yet really understood how this speedup comes about. Good that we seem to agree on the reason for the speedup. I wonder how one can measure memory locality properties with profilers… anyone experience with that (I haven’t done my google searches yet on this)?

Great. Developing things to where they are was a big effort, but integrating it will me much easier as a joint effort. I will apply the new formatting magic to make things easier to read. This MPI stuff required a few “creative” solutions such that I would not be surprised if you have questions along the review; feel free to ask.

I’m pretty sure the profilers let you measure cache misses, which is a very good proxy for memory locality.