As discussed recently in our meetings, I started a wiki on parallelism in Stan here:
The motivating example case are hierarchical ODEs, but I am sure that other areas of Stan can benefit from the proposed approach such that I see the page as a general point of reference for this discussion around parallelism. I think we should get our head around the questions:
Is this a good design principle which we want to introduce to the Stan language?
Any improvements for the design?
Comments are very welcome!
Let me know if sections are not clear and need some more explanations from my end.
@wds15 Are you still around? The wiki doesn’t point to anything anymore, but if there’s a consensus, I’d do some grunt work and build it. I’ve done some parallelism before, in C. Seems like a good time.
I’m digging up something from a long time ago, not sure what updates to NUTS have been done yet.
Sounds like years ago the devs just talked about it and no one said yes or no.
This is an open-source project, so I’d be down to hit it.
Alright, anyone? There was something on @andrewgelman’s blog about another way to parallelize HMC, I don’t remember exactly. I’d be down for that. @Bob_Carpenter Thoughts?
I’m in the mood for some multi-threading in C++. I think @stevebronder has the biggest NUTS on the dev team right now, I’m down to implement this? @wds15 has disappeared.
I need a yes or no before I proceed.
I have successfully merged parallelized C code before, although admittedly, I’m not the best communicator.
A cool project would be a variadic map function with parallelism or/and a reduce sum with mpi backend (you could reformat variadic things tongue rect style implementation and just write an adapter).
the parallel nuts thing I suggested a while ago lets users double the cpu use while giving you a 40% Speedup or so…I d say it’s worth it, given it works on any model almost always. The price to pay is greater code complexity as I recall (or you neatly refactor things).
Great, thanks so much. I just needed a green light to see if this is something we wanted. Adding more features can increase maintenance costs and since it’s a small open source project, it inhibits the ability for maintenance. Thanks everyone
Just a warning: While I proposed the parallel NUTS thing… I was not able to convince key stakeholders to pick it up. The assessment at the time was that the benefits for 30-40% speedup for doubling cpu resource use is not worth if; in particular in view of the added complexity of the code. I personally have a different view as I use all the time computers which have lots of CPUs and being 30-40% faster for ANY model without changing the model at all… screams for me to implement it. There is a test implementation based on the Intel TBB graph parallelism stuff, which was fun to play with.
Alright, stakeholders meaning benefactors? I’m not getting paid, I would be doing this for practice/fun. It could be put on a different branch if someone wants to pull and use it for a specific purpose. I’m independent, I don’t have funding, really. But thanks for the reply. So you’re saying you’re in favor? It doesn’t have to be a merged main branch, just a separate feature if it’s helpful to people with less computational resources. Can you point me to the prototype repo? Thanks.
Yeah, people say integrating C/C++ isn’t a big deal, but when you’re actually doing it it’s like walking on hot coals. I’ll go with oneTBB I guess, it’s more modern and probably better maintained in the future
So I’m just seeing tbb::concurrent vector, does this abstract away the mapping and reducing? I had only done this once, and the multithreading library was internally developed so I don’t have access to documentation anymore.
I’m talking to myself, but I think I’m going to crank up the number of threads, at that line of code, and open a PR, not for the purpose of merging, but, since we have a database of models that evaluates efficiency, I want to see how it scales. Again, I’m not as familiar with TBB, so I have some learning to do. May be I’m wasting Columbia’s computational resources…
And then we can also multithread expensive algorithms like Cholesky decomposition, independently, not just NUTS, right? Are we already doing this or no?