Yup. I think we should implement a global thread pool to gain some more control over this.
I think this is what I suggesting here, no? I am suggesting same # of threads = same result. That gives us already a lot of freedom and people would get exactly the same numbers when running with exactly the same number of threads. That would be fine for me.
I have completed a small POC for this:
- 10^7 terms
- Poisson lpmf
- lambda parameter is a var
What I am doing is to compute the lpdf and it’s gradient, not more.
Note that the 8 core run is using hyperthreading (my MacBook has 4 cores). See the attached results.
Is that convincing to continue? Thoughts?
The code is on the stan-math branch parallel-lpdf
in case you want to look (this is really only a POC, not more).
Hopefully I find the time to apply this to a real problem to see how it performs there.
Best,
Sebastian
lpmf-multicore.pdf (5.6 KB)