I have made a few odd observations when comparing MPI to non-MPI code:
- MPI code gives exactly the same results regardless of the # of CPUs …
- comparing non-MPI code with MPI results gives exactly the same results for the
diagnosemode of the model
- running models with non-MPI vs MPI leads to diverging chains after 100 iterations or so. Differences are very small, but they exist.
The difference I am able to think of which can be relevant here is how the AD graph is represented in memory is differs between the two versions. So the MPI version of
map_rect calculates on the spot all derivatives and inserts them as
precomputed_gradients into the AD graph. The non-MPI version on the other hand creates a somewhat larger AD graph, since results “past”
map_rect are still represented as large graph instead of pre-computed gradients. Having stuff still as big graph and later on calling grad leads to different rounding behavior of the floating point arithmetic when the two different graphs are traversed.
Would others agree on this argument or is this nonsense?
If that is correct, then we have to provide a non-MPI
map_rect implementation which does the same thing as the 1-core MPI version (on the spot calculation). Otherwise, exact reproducibility would be lost between MPI and non-MPI runs (however, MPI run results would not depend on the # of used CPUs).
Thanks for any thoughts and comments.