Parallelization (again) - MPI to the rescue!

Looks like I am now at the point where the MPI is correctly running in my Stan program. I am getting the same results from the MPI parallelized program with 4 cores vs 1 core only. Looking at the running times, there is apparently some overhead:

MPI 4 cores, 40 jobs:
real	1m1.516s
user	4m3.943s
sys	0m0.727s

1 cores, 40 jobs:
real	3m7.077s
user	3m6.306s
sys	0m0.324s

However, this is the very first working prototype, i.e. memory alignment has not yet been given much attention. We should think about a cluster-stan edition of cmdstan (or a contrib directory in cmdstan)?

I am curious as to how programs with analytic Gradients can benefit from this approach. Lets see.

Sebastian