More good news: I got interested in the performance of this MPI stuff on analytic programs. So far I had only tested ODE problems where MPI is easily giving great speedups since the numerical cost per evaluation is really heavy such that the extra communication overhead is not relevant as my previous benchmark has shown.
So I am now calculating the same problem, but instead of an ODE integrator I am using analytical expressions and I have increased the number of jobs to calculate by a factor of 10 to in total 3200. Here are the strong scaling results:
These results are much better than I expected them. We are going from 1 core 2.4h run time down to just 10 minutes on 20 cores which is a whopping 14x speedup!
However, thinking about it I figured that we can probably do better: The problem with analytical expressions is that they are so cheap to calculate and increasing the number of job chunks leads to an increased communication overhead (to and from the workers). But we can do a lot better:

We should change the design such that the there is a shared parameter 1D array theta and a 2D array eta. The theta will be distributed to all nodes and the 2D array will be chunked as usual. This way we can (depending on the problem) cut down the communication cost to the workers.

(and this is a BIG one): Currently, the function is designed to return all outputs which are generated. Thatâs fine and very useful, but extremely wasteful in a Stan program where we are only ever interested in the accumulated value of the log prob density and its derivatives. In short, we should introduce a
map_rect_mpi_lpdf
function which would expect that each function just returns a single value which is the accumulated logprob for a given work chunk. This will dramatically reduce the communication from the workers backwards to the root. Now we do not need anymore to communicate all results and all partials, but we can directly on the workers run the reduction which is to sum up all terms respectively. As a nice byproduct this will reduce the size of the AD stack enormously on the root node to a single entry!!!
I am currently running a benchmark where point 2 is approximately being done by just returning a single value per job chunk. The performance is much better as it looks, I will see. Now, this is quite exciting  and this is technique to run the reduction immediately will be much more beneficial for higher order auto diff. Simply because the reduction can happen directly on the local workers. I really hope I havenât done a mistake in my thinking hereâŚ it sounds to good to be true.