After having map_rect out of the door I think there is one more critical case to get Stan scale well. With map_rect one usually breaks vectorized code into non-vectorized code which is not so good for the performance. Thus, we should probably add facilities to Stan which parallelize the vectorized lpdf/lpmf function calls. This should not be hard as usually this comes down to a double-only for-loop (in the case of rev only programs). For big models I do think that we can gain quite a bit of speed if we run those big for loops using multi-threading (basically what is now on the OpenMP branch from Ben G).
That means that we need to turn
for loops into some parallel execution scheme. The question I now have is if people would insist on exact reproducibility no matter what (so independent of the number of threads) or if we could make things only exactly reproducible if the same number of threads are being used. The thing is, that we often accumulate
lp over the iterations and since floating point arithmetic is not associative it really matters how we do things in this regard.
My current plan is to get a POC up and running to demonstrate the benefits of this and then implement it in stan-math in a way which hopefully only needs C++11 features.
(should this work, then we should probably think about a thread-pool implementation soon).
Ah, and anyone interested in helping/having comments is very welcome, of course!