Looks like I am now at the point where the MPI is correctly running in my Stan program. I am getting the same results from the MPI parallelized program with 4 cores vs 1 core only. Looking at the running times, there is apparently some overhead:
MPI 4 cores, 40 jobs:
real 1m1.516s
user 4m3.943s
sys 0m0.727s
1 cores, 40 jobs:
real 3m7.077s
user 3m6.306s
sys 0m0.324s
However, this is the very first working prototype, i.e. memory alignment has not yet been given much attention. We should think about a cluster-stan edition of cmdstan (or a contrib directory in cmdstan)?
I am curious as to how programs with analytic Gradients can benefit from this approach. Lets see.
Sebastian