We’ve been talking about the possibilities of specialized hardware for Stan. There’s parallel processing and GPU’s. Are there other examples? There’s specialized hardware for deep learning that, for example, make use of known patterns of data reuse. The idea is that there could be chips for Bayes/HMC/Nuts/etc that would enable efficient computation. Maybe focused on hierarchical models, or maybe focused particularly on autodiff and HMC, or maybe focused on probabilistic programming and the handling of uncertainty?
Elaborating on Andrew’s post a bit, we are looking at getting research funds to create hardware that addresses speed/scaling issues in Stan. In no particular order it would help to have info about the following with an eye to being amenable to a hardware solution perhaps with algorithm changes as well:
Current processing bottlenecks for all models/data.
Processing bottlenecks for idiosyncratic models/data.
Scale sensitive issues that grow super linearly.
Right now we are focused on HMC/NUTS as our inference algorithm but this need not be the case.
I realize this is vague but we are looking to identify where we might make progress with custom hardware.
You have to be clear on what you want to speed up. MCMC is embarassingly parallelizable to achieve high ESS/wall time unit. It’s not so parallelizable to get to ESS = 100 quickly.
Why custom hardware rather than existing hardware like TPUs or FPGAs?
What makes GPUs and TPUs possible is that the tensor operations to which they’re applied are massively SIMD.
Ask a hardware person what’s possible without SIMD or with a different kind of SIMD than available on GPU or TPU, and have someone on hand who can understand the answer. Don’t you and Tamara have such a person on your grantwriting team?
On the receiving end, you’ll need someone who understands current GPU and CPU architecture, Stan’s autodiff, and Stan’s sampling and optimization algorithms. I’d ask the folks who worked on GPUs like @stevebronder, @seantalts, and @rok_cesnovar.
It sort of does if we want to do full Bayes in high dimensions. There isn’t anything else competitive. It’s not like we can make Gibbs or Metropolis or ensemble methods faster and succeed.
Ask a hardware person what’s possible without SIMD or with a different kind of SIMD than available on GPU or TPU, and have someone on hand who can understand the answer. Don’t you and Tamara have such a person on your grantwriting team?
We don’t have a grantwriting team, but two of our co-PI’s, Michael Carbin and Vivienne Sze, know a lot about hardware. We were just posting on the Stan forums to get some other perspectives.
On the receiving end, you’ll need someone who understands current GPU and CPU architecture, Stan’s autodiff, and Stan’s sampling and optimization algorithms. I’d ask the folks who worked on GPUs like @stevebronder, @seantalts, and @rok_cesnovar.
breckbaldwin:
Right now we are focused on HMC/NUTS as our inference algorithm but this need not be the case.
It sort of does if we want to do full Bayes in high dimensions. There isn’t anything else competitive. It’s not like we can make Gibbs or Metropolis or ensemble methods faster and succeed.
We have some ideas that can use HMC/NUTS but are not themselves HMC/NUTS. For example, EP. Conversely, there’s the idea that HMC/NUTS can be helpful in improving approximate algorithms.
There are some FPGA toolchains which take OpenCL as input. That provides a short term proof of concept since Stan is already using OpenCL and FPGAs are good for prototyping. In the longer term, being able to unroll the whole gradient computation onto an FPGA would be “game changing” fast since there’s no more pointer chasing.