Saw this paper
An amount of that is in the MAGMA project (from some of the folks who brought us LAPACK) eg http://icl.cs.utk.edu/projectsfiles/magma/doxygen/group__magma__potrf.html
It’s SO COOL people are starting to develop these sorts of things in open source. (Most of the really really cool scalable linear algebra stuff has traditionally been under awful licenses)
This sort of batched cholesky (or any kind of batch linear algebra) is super friendly for GPUs for two reasons in terms of performance:
- its embarrassingly parallel
- non-trivial work for each thread
Its where you get those 100x+ speedups almost for free. Most of the latest research in linear algebra & GPUs is on the topic of batched linear algebra. Not sure if the reason in the rise of deep learning or something else.
Its also friendly in terms of how easy it is to write it if you have the native cholesky decomposition (or any algorithm) written. Just add another dimension to the thread grid and proclaim that dimension to be the batch ID and add an offset to all memory accesses based on the batch ID. There is some gray area on when do you chose to just switch to 1 thread -> 1 cholesky as apposed to N threads -> 1 cholesky but that is just fine tuning.
That is essentially what they do in MAGMA (paper: https://www.sciencedirect.com/science/article/pii/S1877750316305154)
We can extend our kernels fairly easy. More effort would go in to creating tests than extending the kernels. If there is a use case for this, I am all for it.
The problem you have to solve here is how you pack these batches together in a unified memory chunk for easier transfers. With
std::vector<Eigen::MatrixXd> I imagine you have to do a lot of small copies, unless you allocate a large chunk of doubles, make a vector of Eigen::Maps and then just map those doubles in smaller chunks. Eigen does have support for 3D matrices (khm tensors khm) but that is in the unsupported part.
In Magma’s case they accept an array of pointers to the matrices in the batch, which is fine once you have those matrices on the GPU. How you get them there is where it gets tricky. In MAGMA’s case they assume you made those batches in their API and they are present on the device. Which probably in most cases where people use MAGMA its true.
But in our case we at-least have to do the HMC/NUTS/(not sure if there is a new name for the improved sampler) part on the CPU and those transfer have to be efficient. A concatenation of 1 million 10x10 matrices before sending them to the GPU would hurt. But its solvable.
I think Rok and I discussed something like this in the past. The other issue we had was how to expose this in higher level Stan. Would it just be
// data: array of 1000 10x10 matrices matrix[10, 10] foo; // wherever B = cholesky_decompose(foo)
@rok_cesnovar how would reverse mode look for this?
With async this is not that bad, your just paying the overhead transfer cost of making multiple move calls.
Similar to a for loop over these matrices, I guess.
Yeah, if you have anything else to keep the GPU busy at that time its not that bad.
Bottom line is that its doable. If there is a use case for this in Stan, we should think about it.
In general, I think a unary function of a matrix should be valid syntax when given an array of matrices, whether the computation is done on a CPU or a GPU. But since we currently need moderately large matrices to benefit from GPUs, I thought that we could widen the net by using a GPU to handle many smaller matrices as well.
May be a good thing to chat about during Thursdays meeting. I like the idea!