After having map_rect out of the door I think there is one more critical case to get Stan scale well. With map_rect one usually breaks vectorized code into non-vectorized code which is not so good for the performance. Thus, we should probably add facilities to Stan which parallelize the vectorized lpdf/lpmf function calls. This should not be hard as usually this comes down to a double-only for-loop (in the case of rev only programs). For big models I do think that we can gain quite a bit of speed if we run those big for loops using multi-threading (basically what is now on the OpenMP branch from Ben G).

That means that we need to turn for loops into some parallel execution scheme. The question I now have is if people would insist on exact reproducibility no matter what (so independent of the number of threads) or if we could make things only exactly reproducible if the same number of threads are being used. The thing is, that we often accumulate lp over the iterations and since floating point arithmetic is not associative it really matters how we do things in this regard.

My current plan is to get a POC up and running to demonstrate the benefits of this and then implement it in stan-math in a way which hopefully only needs C++11 features.

(should this work, then we should probably think about a thread-pool implementation soon).

Ah, and anyone interested in helping/having comments is very welcome, of course!

Isn’t that the whole point—we break something relatively homogeneous down into parts? Each of the parts will itself be vectorized, so there’s probably some redundant calculation.

I agree that vectorizing the lpdf and lpmf itself would also be possible. We have to figure out how to specify what’s going to happen with multiple layers of concurrency.

Another approach to reproducibility is to require a different operating config for reproduciblity.

Yup. I think we should implement a global thread pool to gain some more control over this.

I think this is what I suggesting here, no? I am suggesting same # of threads = same result. That gives us already a lot of freedom and people would get exactly the same numbers when running with exactly the same number of threads. That would be fine for me.

I have completed a small POC for this:

10^7 terms

Poisson lpmf

lambda parameter is a var

What I am doing is to compute the lpdf and it’s gradient, not more.

Note that the 8 core run is using hyperthreading (my MacBook has 4 cores). See the attached results.

Is that convincing to continue? Thoughts?

The code is on the stan-math branch parallel-lpdf in case you want to look (this is really only a POC, not more).

Hopefully I find the time to apply this to a real problem to see how it performs there.

You are reading it right, but I refined my benchmark a bit and now it is more promising. The example above did parallelize one gradient calculation for 10^7 terms. Now I turned this into 100 repeated evaluations of 10^5 terms. With this scheme I am seeing a doubling in speed for 4 cores, which sounds good to me.

Ultimately, I would like to see the gains for a model + data set I recently came accross. For that I need to code through the poisson_log_lpmf and the binomial_logit_lpmf.

For now I still think that this approach has merits - mostly by the fact that users should ideally only need to switch it on and get a speedup. No hassle to recode anything. Doubling in speed with 4 cores would be nice, but let’s see a real case.

I totally forgot that I do not even need to turn on thread_local for the AD stack when I parallelize the lpdf+lpmf functions. This is only about parallelizing double only loops which is safe regardless. Now I get on my toy example a doubling in speed with 3 cores. Very motivating.

Hmm… playing a little more with things, we may actually not really need to parallelize the lpdfs. Instead we can just use map_rect and get very similar performance. What I tried is to calculate from 5x10^4 terms the poisson log-lik and it’s gradient. This was chunked into 5 chunks for the map_rect version whereas the threaded version evaluated everything in a single go. I then varied the MPI cores and the threads a little and got these runtimes (the mr thing is map_rect):

1 core
[ RUN ] large.mr_poisson_lpmf
[ OK ] large.mr_poisson_lpmf (5089 ms)
[ RUN ] large.poisson_lpmf
[ OK ] large.poisson_lpmf (4353 ms)
3 MPI cores, 1 thread
[ RUN ] large.mr_poisson_lpmf
[ OK ] large.mr_poisson_lpmf (2535 ms)
[ RUN ] large.poisson_lpmf
[ OK ] large.poisson_lpmf (4457 ms)
1 MPI core, 3 threads
[ RUN ] large.mr_poisson_lpmf
[ OK ] large.mr_poisson_lpmf (2954 ms)
[ RUN ] large.poisson_lpmf
[ OK ] large.poisson_lpmf (2019 ms)
1 MPI core, 5 threads
[ RUN ] large.mr_poisson_lpmf
[ OK ] large.mr_poisson_lpmf (2896 ms)
[ RUN ] large.poisson_lpmf
[ OK ] large.poisson_lpmf (1604 ms)
5 MPI cores, 1 thread
[ RUN ] large.mr_poisson_lpmf
[ OK ] large.mr_poisson_lpmf (1632 ms)
[ RUN ] large.poisson_lpmf
[ OK ] large.poisson_lpmf (4484 ms)

So we see, we get almost the same speedup with MPI map_rect in comparison to the thread thing. Note that it is important to use the MPI map_rect backend which avoids the need for a thread_local AD stack.

To me this means we should invest into some data reshaping utilities first in order to get map_rect being used easily. Parallelizing the lpdf looks to me as is if this would be a convenience, but we can get the speedups with map_rect already (most of it).

This is why I’ve been so excited about MPI all along. It’s so wonderfully general for the kinds of problems we have. Multi-threading in these data-only cases is appealing, though.

Maybe I start with a case study or example model. From there we can see what is doable to push into Stan. One thing which I could image is usable is to map 1D storage via (ideally nested) slice indices to higher order structures. That way one can flatten the data, but one would still be able to address the data in a convenient form.