Parallelization of large vectorized expressions

Yup. I think we should implement a global thread pool to gain some more control over this.

I think this is what I suggesting here, no? I am suggesting same # of threads = same result. That gives us already a lot of freedom and people would get exactly the same numbers when running with exactly the same number of threads. That would be fine for me.

I have completed a small POC for this:

  • 10^7 terms
  • Poisson lpmf
  • lambda parameter is a var

What I am doing is to compute the lpdf and it’s gradient, not more.

Note that the 8 core run is using hyperthreading (my MacBook has 4 cores). See the attached results.

Is that convincing to continue? Thoughts?

The code is on the stan-math branch parallel-lpdf in case you want to look (this is really only a POC, not more).

Hopefully I find the time to apply this to a real problem to see how it performs there.

Best,
Sebastian

lpmf-multicore.pdf (5.6 KB)