I see that Bob Carpenter has just posted (at Andrew Gelman’s blog) about support in the current version of Stan for GPU and MPI parallelism. I haven’t upgraded from 2.17, because that is the current documentation, but if Stan is actually at 2.19 or so, and supports all these nice things, then where can I get some docs (I’ll settle for draft) that would allow me to make use of them? Because I really would like to.
Is it possible to tell when Stan with parallelism through MPI or threading will land, including a documentation that makes implementing this straight forward?
As far as I can tell, the only Stan interface that currently allows MPI parallelism is CmdStan 2.18 (though the most recent version linked to on mc-stan.org is 2.17).
Here is some documentation for MPI paralllelism, but this does not tell the user how to turn on parallelism.
MPI and threading are in the released 2.18 which you can download from github at the moment. We are going to announce 2.18 broadly once RStan and PyStan have upgrade to 2.18.
The documentation, which is linked along with the releases (look on github for the cmdstan + stan release), contains examples on how to enable parallelism. Please note that only very computation heavy things will benefit from MPI or threading.
GPU support is not yet in any released Stan. For that we wait for 2.19 or possibly until 2.20.
It is heavy, yes; but it’s not guranteed that map_rect will give you speedups. Given the hierarchical model structure it makes sense to try this out. The applications I had in mind are hierarchical ODE models which are crazily expensive per unit of a given model. For your case it will be interesting to see how things perform. The thing is that the vectorized expressions in Stan are super fast. When you now break this in bigger chunks with map_rect you do pay a price for that which is hopefully offset by the fact that you can scale the performance. In short, this will be an interesting case to try it out. As a rule of thumb you should chunk your 150k observations into just a few blocks (hopefully your hierarchical model is already giving you a “natural” chunking size which makes sense). I am certainly interested in hearing the results.