Parallelism in Stan

Hi!

As discussed recently in our meetings, I started a wiki on parallelism in Stan here:

The motivating example case are hierarchical ODEs, but I am sure that other areas of Stan can benefit from the proposed approach such that I see the page as a general point of reference for this discussion around parallelism. I think we should get our head around the questions:

  • Is this a good design principle which we want to introduce to the Stan language?
  • Any improvements for the design?

Comments are very welcome!

Let me know if sections are not clear and need some more explanations from my end.

Best,
Sebastian

@wds15 Are you still around? The wiki doesn’t point to anything anymore, but if there’s a consensus, I’d do some grunt work and build it. I’ve done some parallelism before, in C. Seems like a good time.

I’m digging up something from a long time ago, not sure what updates to NUTS have been done yet.

Sounds like years ago the devs just talked about it and no one said yes or no.

This is an open-source project, so I’d be down to hit it.

I think we’re talking about this issue: https://github.com/stan-dev/stan/issues/2818https://github.com/stan-dev/stan/issues/2818

Was there a yes or no consensus on this? Is this outdated?

I’m looking for something to do.

I’m happy to build it as long as we have some good reviewers that understand parallelization (you).

And here’s another resource: Parallel dynamic HMC merits

Thoughts?

Alright, anyone? There was something on @andrewgelman’s blog about another way to parallelize HMC, I don’t remember exactly. I’d be down for that. @Bob_Carpenter Thoughts?

I’m in the mood for some multi-threading in C++. I think @stevebronder has the biggest NUTS on the dev team right now, I’m down to implement this? @wds15 has disappeared.

I need a yes or no before I proceed.

I have successfully merged parallelized C code before, although admittedly, I’m not the best communicator.

Still around! Just busy…happy to look at stuff.

A cool project would be a variadic map function with parallelism or/and a reduce sum with mpi backend (you could reformat variadic things tongue rect style implementation and just write an adapter).

the parallel nuts thing I suggested a while ago lets users double the cpu use while giving you a 40% Speedup or so…I d say it’s worth it, given it works on any model almost always. The price to pay is greater code complexity as I recall (or you neatly refactor things).

Great, thanks so much. I just needed a green light to see if this is something we wanted. Adding more features can increase maintenance costs and since it’s a small open source project, it inhibits the ability for maintenance. Thanks everyone

Just a warning: While I proposed the parallel NUTS thing… I was not able to convince key stakeholders to pick it up. The assessment at the time was that the benefits for 30-40% speedup for doubling cpu resource use is not worth if; in particular in view of the added complexity of the code. I personally have a different view as I use all the time computers which have lots of CPUs and being 30-40% faster for ANY model without changing the model at all… screams for me to implement it. There is a test implementation based on the Intel TBB graph parallelism stuff, which was fun to play with.

Alright, stakeholders meaning benefactors? I’m not getting paid, I would be doing this for practice/fun. It could be put on a different branch if someone wants to pull and use it for a specific purpose. I’m independent, I don’t have funding, really. But thanks for the reply. So you’re saying you’re in favor? It doesn’t have to be a merged main branch, just a separate feature if it’s helpful to people with less computational resources. Can you point me to the prototype repo? Thanks.

This should be the one: GitHub - stan-dev/stan at feature/speculative-nuts · GitHub It did run with the stan stuff from back then. TBB / oneTBB… should not matter.

Yeah, people say integrating C/C++ isn’t a big deal, but when you’re actually doing it it’s like walking on hot coals. I’ll go with oneTBB I guess, it’s more modern and probably better maintained in the future

So I’m just seeing tbb::concurrent vector, does this abstract away the mapping and reducing? I had only done this once, and the multithreading library was internally developed so I don’t have access to documentation anymore.

I’m looking here: const bool run_serial = stan::math::internal::get_num_threads() == 1;

line 248: stan/src/stan/mcmc/hmc/nuts/base_nuts.hpp at feature/speculative-nuts · stan-dev/stan · GitHub

Is this SMP or MPP? I haven’t used TBB. Ususally you have you write a mapping function and a reducing function, not sure, I don’t remember.

And how far did you push it? How many threads and on what models?

It looks like on TBB a lot of the programming is abstracted away from you, am I wrong?

I had some help, sure, but it has been 3 years since I tried to multithread anything.

I’m talking to myself, but I think I’m going to crank up the number of threads, at that line of code, and open a PR, not for the purpose of merging, but, since we have a database of models that evaluates efficiency, I want to see how it scales. Again, I’m not as familiar with TBB, so I have some learning to do. May be I’m wasting Columbia’s computational resources…

And then we can also multithread expensive algorithms like Cholesky decomposition, independently, not just NUTS, right? Are we already doing this or no?