Hi!

Damn…took me forever to get this idea: The bottleneck to parallelize Stan programs is autodiff which cannot be used in a threaded program due to its design with singletons.

Now, MPI runs a program in separate processes and hence AD can be run within each MPI process without a problem. Attached is a toy example program which demonstrates the feasibility to calculate the gradient in each sub-process. Running the program with `mpirun -np 4 ./hello`

gives me this output:

```
1
./hello
Hello World! from process 1
Worker 1: Gradient of f for theta: 1, 2; gradient: 1, 4
1
./hello
Hello World! from process 2
Worker 2: Gradient of f for theta: 1, 4; gradient: 1, 8
1
./hello
Hello World! from process 0
1
./hello
Hello World! from process 3
Worker 0: Gradient of f for theta: 1, 0; gradient: 1, 0
Worker 3: Gradient of f for theta: 1, 6; gradient: 1, 12
```

The output is garbled due to concurrent writing to std::cout, but the important points are:

- It does not crash
- AD runs
- The gradients per process are correct

Voila!

I think there is now a very clear path now forward to get within chain parallelization going:

- We need to distributed the transformed data to all childs once
- The really expensive functions (ODE, whatever) are handled by the MPI childs
- The MPI childs only recieve parameters for which function values and their gradients gets calculate - for any function as AD can simply be run. Results get passed around in vector only format which is then feed into the main-process AD stack via pre-computed gradients.

That should solve any performance problems given enough machines. The only thing to watch out for is to cut the Stan programs into clever blocks such that the communication cost is low.

And to integrate this with the stan language we could probably think about a _mpi ending for functions which can be used with this approach.

Sebastian

`hello.cpp`

:

```
#include <boost/mpi.hpp>
#include <boost/math/tools/promotion.hpp>
#include <iostream>
#include <stan/math.hpp>
struct fun {
template <typename T>
T operator()(const Eigen::Matrix<T, Eigen::Dynamic, 1>& theta) const {
using std::pow;
T a = theta[0];
T b = theta[1];
return a + pow(b, 2);
}
};
int main(int argc, char* argv[])
{
boost::mpi::environment env(argc, argv);
boost::mpi::communicator world;
std::cout << argc << std::endl;
std::cout << argv[0] << std::endl;
std::cout << "Hello World! from process " << world.rank() << std::endl;
fun f;
Eigen::VectorXd theta(2);
theta(0) = 1;
theta(1) = 2*world.rank();
double fx;
Eigen::VectorXd grad_fx;
stan::math::gradient(f, theta, fx, grad_fx);
std::cout << "Worker " << world.rank() << ": Gradient of f for theta: " << theta(0) << ", " << theta(1) << "; gradient: " << grad_fx(0) << ", " << grad_fx(1) << std::endl;
return 0;
}
```

Note: To get the above running, you need openmpi and boost mpi and boost serialization library installed and linked.