Hi!

I have made a few odd observations when comparing MPI to non-MPI code:

- MPI code gives
*exactly*the same results regardless of the # of CPUs … - comparing non-MPI code with MPI results gives exactly the same results for the
`diagnose`

mode of the model - running models with non-MPI vs MPI leads to
**diverging chains after 100 iterations or so**. Differences are very small, but they exist.

The difference I am able to think of which can be relevant here is how the AD graph is represented in memory is differs between the two versions. So the MPI version of `map_rect`

calculates on the spot all derivatives and inserts them as `precomputed_gradients`

into the AD graph. The non-MPI version on the other hand creates a somewhat larger AD graph, since results “past” `map_rect`

are still represented as large graph instead of pre-computed gradients. Having stuff still as big graph and later on calling grad leads to different rounding behavior of the floating point arithmetic when the two different graphs are traversed.

Would others agree on this argument or is this nonsense?

If that is correct, then we have to provide a non-MPI `map_rect`

implementation which does the same thing as the 1-core MPI version (on the spot calculation). Otherwise, exact reproducibility would be lost between MPI and non-MPI runs (however, MPI run results would not depend on the # of used CPUs).

Thanks for any thoughts and comments.

Best,

Sebastian