Extending reduce_sum to use MPI

Hi Stan Dev team -

I am reaching because I have found reduce_sum to be incredibly helpful for fitting big models, and the only limitation is the threaded parallelization that means I can’t use more than 1 machine on my uni’s cluster.

I understand that extending reduce_sum to use a more general interface like MPI would be no mean feat, but I wanted to get an idea of just how hard it would be and whether I should consider applying for grant funding for a programmer (post doc?) to try to tackle this task. map_rect is difficult to use for the latent variable problems that I primarily work on that have lots of irregular arrays.

Thanks much. for any general ideas… and if it’s technically infeasible, great to know too :).

1 Like

I’d love that feature as well. My personal approach would be to piggy-pack stuff given to a MPI reduce_sum to map_rect MPI. That would mean that the variadic arguments have to be packed in what map_rect offers, which should be doable. Then one could just leverage the MPI map_rect facility. There has been some progress as I recall in making map_rect work with closures, which would remove the need for a reduce_sum with an MPI backend, I think (you could just pack the complicated arguments into the closure). @nhuurre has worked on this… and maybe we should pick this up.

I’d assume that getting someone to work on it is the right thing. When doing that, one should ideally also have in view the bigger picture. The MPI things were designed to run within a program which runs a single chain only per process. With the upcoming threaded stan api that assumption will be broken and some engineering is needed to map MPI workers to chains running in threads.

Where can I find more info on this?

@saudiwin I agree that an MPI reduce_sum would be great to have.

To the best of my knowledge no one from the stan dev team is currently working on this, at least no one expressed it publicly. The idea has been circling around and discussed a few time.

Getting a post doc to work on this would be great. Definitely will make it in faster then if we wait for it to come on the radar of any of the stan devs :) I would suggest opening an issue in Stan Math so others know its something users want.

I think the assumption here should be that with MPI, a single executable only runs a single chain. Otherwise things could get messy real fast.

Here is the design doc: https://github.com/stan-dev/design-docs/pull/40 There are links to the open PRs in the thread.

Agree.

This limit is due to map_rect design not MPI, and it actually prompts me to design another MPI implementation. By design MPI is natural for parallel chains, which is why @bbbales2 and I were able to quickly implement cross-chain and now I can apply it to cluster. I think now is a good time to rethinking Stan’s parallelism framework instead of scaffolding on existing things.

1 Like

It doesn’t solve your immediate needs but an alternative that deserve a grant is to implement a subset of MPI-2 standard for Stan. It offers better flexibility and allows user to do parallel beyond current map-reduce paradigm.

2 Likes