If any other C++ devs aren’t following the github, here’s the discussion we’re having. If steve’s not available, any input is appreciated.
Again, benchmarks were done just on executing stan/math/prim/fun/exp.hpp and the text file is available, and I’m happy to give a walk through. But it may have had a positive effect on final Stan run times, but hard to tell because it’s stochastic. But posting here for exposure to what I’m investigating. But any questions answered are appreciated.
If multithreading prim improves performance, sure. No reason in not threading it. Esp. if it’s something used repetitively in the library then there’s possible combinatorial gains in speed, but this needs to be evaluated more thoroughly.
But the devs are focused on multithreading AD, I think sending different threads per expression tree? If someone wants to elaborate a bit that’s cool.
develop ← drezap:feature/issue-3311-test-thread-tbb-exp
> Looking at the jenkins it seems
Ok, let me check travis CI. I have to change … local software to match so I can reproduce.
> Do you mean the results from the stan build-bot running the performance regression tests?
No, I mean the local benchmarks, only in C++ were objectively faster in suspected cases, using internal print statements. Here, I attached a file, but it's not fun to sift through. There's only one iteration in non-threaded, because we're not scaling, but initially I tried scaling by threads, blocks, dataset size, etc. It's a repeatable experiment.
[benchmarking_multhreading.txt](https://github.com/user-attachments/files/27549456/benchmarking_multhreading.txt). The non-perfect forwarding is faster. I can do more robust tests.
**EDIT:**
And then the benchmarks were run with these scripts, on this branch: `test/unit/math/prim/fun/exp_test.cpp'. But I was modifying the typing, and you can reproduce it via the pushes. If not, I'm down to do a screen share and I can just show you. May be 10-15 minutes.
Reading some literature today, this agreed with some my thoughts about speed when distributing and collecting threads. I was looking at, C++ Concurrency in Action: Practical Multithreading, Williams 2012. But when I removed const and passed by reference and added an lvalue instantiation `type &blah = a;` it made it way slower. You guys would probably know. In the last commit, I did an `lvalue` instatiation in with `my_a`, not `const` and then also initialized within the class, within a function, another `lvalue` instantiation, and it made it slower. I think I'm making extra copies somewhere in memory? But commit [6837a62](https://github.com/stan-dev/math/commit/6837a52915f5859e7532dc614a45d27f4eef2426) was the one that was fastest and agreed with some literature. (I.E. too many threads caused a slowdown but with the right amount of threads this was faster, and this also passed all jenkins tests).
> If you are interested in this I would focus on seeing if there are ways you can do parallelism on the reverse mode code. That is a pretty hard problem though that I have not found a nice answer to yet.
Ok, but if concurrency (multithreading) simple stuff adds performance gains, worth adding, add it, if not, I'm not offended.
For reverse mode autodiff, can you give me a more formal project spec in an issue? Then I'll look into it. You're seeing if, given a functions_i, f(.), we can send a different thread through each function to build the expression tree in parallel? So suppose we need to compute derivatives for f(.) and g(.), we want to build two expression trees at once using concurrency? I'm trying to specify the problem more clearly. Perhaps I'm not understanding.
**EDIT: WRT Travis CI:**
https://app.travis-ci.com/github/stan-dev/stan-dev.github.io/builds/259743285
The last updates I'm seeing are from 3 years ago? Am I missing something?