Immutable data [GPU and MPI]



For MPI we have now a proposal to manage immutable data in a distributed way. From Bob’s blog post I concluded that this could also be of interest to GPU computing if I got his comments right. Do we need to consider this in some way now in the design?

I can well imagine, that for GPs (for example), it would be quite attractive to copy the immutable data to the GPU just once and then reuse it for each iteration. That may give another performance bump.



Yes, we are absolutely going to want to do this for the GPU code.

Can’t we just do this the same way you’ve proposed handling MPI?



I think we can do the same design principle, yes. So in essence:

  • the gpu/mpi function gets called with the immutable data and a uid
  • if the uid has not yet been seen, then the immutable data is distributed for MPI/sent to the GPU
  • the distribution ensures that the uid will be recognized next time the function is called
  • after the first call of the function we will assume that same uid is equivalent to same data

We will only end up having more and more singletons floating around in our code-base. I think this GPU stuff can even be nested in MPI calls when doing like that.

A GPU version of cov_exp_quad_cholesky should be super handy for GPs, I think.



That’s exactly why they’re starting where they’re starting—with Cholesky factorization. It’s O(N^2) data but O(N^3) computations. Hopefully we’ll be able to get to the point where we can pass a data matrix (N^2) once and then calculate a matrix-vector product efficiently using the GPU.


Yup. In fact, for cov_exp_quad_cholesky there is not an urgent need to transfer the data to the GPU as the you need to process N data items to define the N^2 matrix. Of course, transferring the N data items just once is even better.

I am really looking forward to that.


I’m going to dance a jig on a table when we can distribute jobs over multiple GPU-enabled cores!


I am going to remind you on that during a stan meeting once i merged and ran those two branches… I agree that those two techniques have separatley a huge potential and taken together they make Stan a new beast. However, you need serious hardware to get this going and a lot of time to get it to compile I guess.


Sure, nobody’s going to get faster models on their notebooks. We designed Stan to solve hard problems, and this will really push the frontier of what we can solve!