Immutable data [GPU and MPI]

wds15 · September 10, 2017, 12:48pm

Hi!

For MPI we have now a proposal to manage immutable data in a distributed way. From Bob’s blog post I concluded that this could also be of interest to GPU computing if I got his comments right. Do we need to consider this in some way now in the design?

I can well imagine, that for GPs (for example), it would be quite attractive to copy the immutable data to the GPU just once and then reuse it for each iteration. That may give another performance bump.

Sebastian

Bob_Carpenter · September 11, 2017, 12:24pm

Yes, we are absolutely going to want to do this for the GPU code.

Can’t we just do this the same way you’ve proposed handling MPI?

wds15 · September 11, 2017, 1:00pm

Hi!

I think we can do the same design principle, yes. So in essence:

the gpu/mpi function gets called with the immutable data and a uid
if the uid has not yet been seen, then the immutable data is distributed for MPI/sent to the GPU
the distribution ensures that the uid will be recognized next time the function is called
after the first call of the function we will assume that same uid is equivalent to same data

We will only end up having more and more singletons floating around in our code-base. I think this GPU stuff can even be nested in MPI calls when doing like that.

A GPU version of cov_exp_quad_cholesky should be super handy for GPs, I think.

Sebastian

Bob_Carpenter · September 13, 2017, 11:28am

That’s exactly why they’re starting where they’re starting—with Cholesky factorization. It’s O(N^2) data but O(N^3) computations. Hopefully we’ll be able to get to the point where we can pass a data matrix (N^2) once and then calculate a matrix-vector product efficiently using the GPU.

wds15 · September 14, 2017, 3:40pm

Yup. In fact, for cov_exp_quad_cholesky there is not an urgent need to transfer the data to the GPU as the you need to process N data items to define the N^2 matrix. Of course, transferring the N data items just once is even better.

I am really looking forward to that.

Bob_Carpenter · September 15, 2017, 4:43pm

I’m going to dance a jig on a table when we can distribute jobs over multiple GPU-enabled cores!

wds15 · September 17, 2017, 6:57am

I am going to remind you on that during a stan meeting once i merged and ran those two branches… I agree that those two techniques have separatley a huge potential and taken together they make Stan a new beast. However, you need serious hardware to get this going and a lot of time to get it to compile I guess.

Bob_Carpenter · September 17, 2017, 3:44pm

Sure, nobody’s going to get faster models on their notebooks. We designed Stan to solve hard problems, and this will really push the frontier of what we can solve!

Topic		Replies	Views
How will MPI and the GPU code work together? Developers math	5	1311	March 9, 2018
GPU Update: what's up and where we are going Developers features , math	29	2760	November 12, 2018
MPI Design Discussion Developers	267	15851	January 30, 2018
Does STAN with opencl use GPU for the Cholesky decomposition for computing multivariate normal density? General	3	472	February 21, 2024
Map_rect & data for the case of MPI Developers	3	660	January 23, 2020

Immutable data [GPU and MPI]

Related topics