Extending reduce_sum to use MPI

saudiwin · May 27, 2021, 5:46am

Hi Stan Dev team -

I am reaching because I have found reduce_sum to be incredibly helpful for fitting big models, and the only limitation is the threaded parallelization that means I can’t use more than 1 machine on my uni’s cluster.

I understand that extending reduce_sum to use a more general interface like MPI would be no mean feat, but I wanted to get an idea of just how hard it would be and whether I should consider applying for grant funding for a programmer (post doc?) to try to tackle this task. map_rect is difficult to use for the latent variable problems that I primarily work on that have lots of irregular arrays.

Thanks much. for any general ideas… and if it’s technically infeasible, great to know too :).

wds15 · May 27, 2021, 8:38am

I’d love that feature as well. My personal approach would be to piggy-pack stuff given to a MPI reduce_sum to map_rect MPI. That would mean that the variadic arguments have to be packed in what map_rect offers, which should be doable. Then one could just leverage the MPI map_rect facility. There has been some progress as I recall in making map_rect work with closures, which would remove the need for a reduce_sum with an MPI backend, I think (you could just pack the complicated arguments into the closure). @nhuurre has worked on this… and maybe we should pick this up.

I’d assume that getting someone to work on it is the right thing. When doing that, one should ideally also have in view the bigger picture. The MPI things were designed to run within a program which runs a single chain only per process. With the upcoming threaded stan api that assumption will be broken and some engineering is needed to map MPI workers to chains running in threads.

stemangiola · May 28, 2021, 5:54am

Where can I find more info on this?

rok_cesnovar · May 28, 2021, 6:30am

@saudiwin I agree that an MPI reduce_sum would be great to have.

To the best of my knowledge no one from the stan dev team is currently working on this, at least no one expressed it publicly. The idea has been circling around and discussed a few time.

Getting a post doc to work on this would be great. Definitely will make it in faster then if we wait for it to come on the radar of any of the stan devs :) I would suggest opening an issue in Stan Math so others know its something users want.

I think the assumption here should be that with MPI, a single executable only runs a single chain. Otherwise things could get messy real fast.

Here is the design doc: https://github.com/stan-dev/design-docs/pull/40 There are links to the open PRs in the thread.

yizhang · June 1, 2021, 6:20pm

Agree.

This limit is due to map_rect design not MPI, and it actually prompts me to design another MPI implementation. By design MPI is natural for parallel chains, which is why @bbbales2 and I were able to quickly implement cross-chain and now I can apply it to cluster. I think now is a good time to rethinking Stan’s parallelism framework instead of scaffolding on existing things.

yizhang · June 1, 2021, 6:31pm

It doesn’t solve your immediate needs but an alternative that deserve a grant is to implement a subset of MPI-2 standard for Stan. It offers better flexibility and allows user to do parallel beyond current map-reduce paradigm.

Topic		Replies	Views
MPI todo list / Stan language specifics Developers	1	688	January 27, 2018
MPI roadmap Developers	16	1814	March 11, 2018
Reduce_sum() using only one thread Interfaces	2	425	June 10, 2020
Correct way to use MPI with cmdstanpy Modeling	9	701	August 29, 2020
Map_rect in the Stan language (MPI is coming) Developers	21	2669	July 14, 2020

Extending reduce_sum to use MPI

Related topics