My collaborator Alex and I are research scientists at Petuum, Inc. (and graduate students at CMU/Imperial). Petuum is working to build a data center operating system that supports all major programming languages and frameworks used in the modern statistics and machine learning community.
We are interested in including Stan in our efforts, because of its ease of use, simple modeling syntax, attention to detail in the HMC implementation, and robust design suitable for a wide variety of users and platforms.
One of the things we are interested in is supporting probabilistic programming languages on a variety of compute architectures, such as GPUs and compute clusters. This could be achieved in Stan by introducing a computational graph framework backend (TensorFlow, DyNet, …) that could perform automatic parallelization and deployment to a variety of systems.
Compared to the current design, this would offer the following advantages:
Maintainability: this would allow Stan to define the mathematical operations it needs, without worrying about optimizing or executing them, simplifying debugging and saving developer time.
Performance: through automatic parallelization and deployment, gradient operations can be transparently executed on GPUs and compute clusters (for applicable models).
We wanted to ask for your input on this idea. If practical and realistic, Petuum may be willing to invest time and engineering resources into its development. We would contribute our changes back as open source to the community.
This seems great to me (although I’m not the most knowledgeable about what it would take to do, but people who know what they’re talking about will respond soon enough).
The one thing to keep an eye on is that the HMC algorithms in Stan probably won’t work in single precision, so some architectures (ie ones that either don’t have doubles or have slow doubles) won’t be appropriate.
For Stan, that’d mean rewriting pretty much the entire math library and probably the sampling algorithms. Same issue as if you wanted to change R’s backend. At that point, you’d probably be better off just starting from scratch. Maybe you could use the AST generated by our parser and change the code generator.
What we’re doing instead is adding MPI for multi-core and GPU (double precision through OpenCL) for some matrix operations. The MPI is the big win as most likelihoods are embarassingly parallelizable. The GPU operations will help with operations like Cholesky factorization where data size is quadratic and it requires a cubic amount of matrix arithmetic.
Nope, not really any doc other than what’s in the code. The root of the code tree for the AST is here:
It’s an object-oriented/templated C++ structure representing the parse tree for a Stan program. The most complicated part is the variant types for when there are disjunctions in the language. These require fairly complex callbacks to deal with (to write a function for a variant type, you need a function defined on each of the types—Boost organizes these into a class).
The AST output is then plugged into the generator. You could go up a level and trace down from compiler.hpp.