Linear algebra for large arrays of small matrices

bgoodri · September 30, 2019, 11:21pm

Saw this paper

which talks about things like how to compute Cholesky factors of a million 10x10 matrices. @stevebronder and @rok_cesnovar, how GPU friendly would something like that be?

anon75146577 · October 1, 2019, 5:39am

An amount of that is in the MAGMA project (from some of the folks who brought us LAPACK) eg http://icl.cs.utk.edu/projectsfiles/magma/doxygen/group__magma__potrf.html

It’s SO COOL people are starting to develop these sorts of things in open source. (Most of the really really cool scalable linear algebra stuff has traditionally been under awful licenses)

rok_cesnovar · October 1, 2019, 9:53am

This sort of batched cholesky (or any kind of batch linear algebra) is super friendly for GPUs for two reasons in terms of performance:

its embarrassingly parallel
non-trivial work for each thread

Its where you get those 100x+ speedups almost for free. Most of the latest research in linear algebra & GPUs is on the topic of batched linear algebra. Not sure if the reason in the rise of deep learning or something else.

Its also friendly in terms of how easy it is to write it if you have the native cholesky decomposition (or any algorithm) written. Just add another dimension to the thread grid and proclaim that dimension to be the batch ID and add an offset to all memory accesses based on the batch ID. There is some gray area on when do you chose to just switch to 1 thread -> 1 cholesky as apposed to N threads -> 1 cholesky but that is just fine tuning.

That is essentially what they do in MAGMA (paper: https://www.sciencedirect.com/science/article/pii/S1877750316305154)

We can extend our kernels fairly easy. More effort would go in to creating tests than extending the kernels. If there is a use case for this, I am all for it.

The problem you have to solve here is how you pack these batches together in a unified memory chunk for easier transfers. With std::vector<Eigen::MatrixXd> I imagine you have to do a lot of small copies, unless you allocate a large chunk of doubles, make a vector of Eigen::Maps and then just map those doubles in smaller chunks. Eigen does have support for 3D matrices (khm tensors khm) but that is in the unsupported part.

In Magma’s case they accept an array of pointers to the matrices in the batch, which is fine once you have those matrices on the GPU. How you get them there is where it gets tricky. In MAGMA’s case they assume you made those batches in their API and they are present on the device. Which probably in most cases where people use MAGMA its true.

But in our case we at-least have to do the HMC/NUTS/(not sure if there is a new name for the improved sampler) part on the CPU and those transfer have to be efficient. A concatenation of 1 million 10x10 matrices before sending them to the GPU would hurt. But its solvable.

stevebronder · October 1, 2019, 11:53am

I think Rok and I discussed something like this in the past. The other issue we had was how to expose this in higher level Stan. Would it just be

// data: array of 1000 10x10 matrices
matrix[10, 10] foo[1000];
// wherever
B = cholesky_decompose(foo)

@rok_cesnovar how would reverse mode look for this?

With async this is not that bad, your just paying the overhead transfer cost of making multiple move calls.

rok_cesnovar · October 1, 2019, 4:39pm

Similar to a for loop over these matrices, I guess.

Yeah, if you have anything else to keep the GPU busy at that time its not that bad.

Bottom line is that its doable. If there is a use case for this in Stan, we should think about it.

bgoodri · October 1, 2019, 7:03pm

In general, I think a unary function of a matrix should be valid syntax when given an array of matrices, whether the computation is done on a CPU or a GPU. But since we currently need moderately large matrices to benefit from GPUs, I thought that we could widen the net by using a GPU to handle many smaller matrices as well.

stevebronder · October 1, 2019, 7:39pm

May be a good thing to chat about during Thursdays meeting. I like the idea!

Topic		Replies	Views
ViennaCL with stan: Cholesky Benchmark Project Proposals stan-math	56	6231	July 23, 2018
Making big kernels work General	34	2040	August 10, 2018
Taking advantage of both sparse matrices and GPUs Algorithms	7	829	September 19, 2024
Building the GPU Cholesky branch General	2	862	August 8, 2017
Stanc3 Math lib opencl integration Developers	29	1242	September 23, 2019

Linear algebra for large arrays of small matrices

Related topics