Hello! Curious as to when rstan 2.19 is likely be out, which I presume will have GPU support :)

Thanks!

Hello! Curious as to when rstan 2.19 is likely be out, which I presume will have GPU support :)

Thanks!

It can be done with the GitHub versions of StanHeaders and rstan currently. I’m trying to sort out some remaining issues today so that it can be uploaded to CRAN.

3 Likes

Hopefully users should be able to follow the install instructions for OpenCL available here. Once OpenCL is installed it should just be adding the flags `-DSTAN_OPENCL`

,

`OPENCL_DEVICE_ID`

, `OPENCL_PLATFORM_ID`

to `Makevars`

following the info to set those in the doc above.

1 Like

Is there by any chance any documentation on what operations will be sped up?

I see these ones

`cholesky_decompose`

`inverse`

`diagonal_multiply`

Others for example (?):

- simple matrix multiplications
- transpose a matrix
- simple matrix manipulation (e.g., to vector)
- simple element wise operations(matrix .* matrix; , square(matrix))
- …

Thanks

As of 2.19 (or 2.19.1) only `cholesky_decompose`

speedups are exposed to the Stan users.

There are some other functions that are sped up under the hood (lower/upper triangular inverses, various forms of matrix multiplication, etc) but those are currently only used inside the `cholesky_decompose`

implementation. They are now being integrated in the Stan user exposed mdivide_left_tri, multiply, etc. functions.

We should see speedups for matrix multiply, mdivide_left_tri and some GLMs speedups exposed to the user in 2.20. We are working on 2 larger OpenCL backend features (caching and async/out of order execution) and then we should be able to flush out those.

Transposing, element wise operations and such are a different story. We are currently only looking to speedup individual Stan functions where the input and output are both in the CPUs global memory for each iteration. The speedups of using a GPU for transposing a matrix is not large enough (too simple operations) to overcome the added overhead of transferring data to and from the GPU.

We will be able to provide speedups even for these simple function for constant data (matrices of doubles in the backend) but not for variables (matrices of stan::math::var in the backend Stan Math). For variables we might be able to do this with Stan3 but that is still some time away.

Those are both on the short list to be evaluated next, yes. Once caching and async(out of order execution) are finished we might do another post to get some user feedback on their bottlenecks.

I forgot to mention in the previous post that `gp_exp_quad_cov`

(the Stan function is cov_exp_quad I think) is also in the works and will be ready for 2.20.

Shouldn’t we also go for a `gp_exp_quad_cholesky`

??

I mean we usually need the the cholesky of the GP kernel. With a `gp_exp_quad_cholesky`

the communication cost is reduced and all expensive steps are done in a single go on the GPU. Or is this planned to be handled in another approach (caching/async).

1 Like

Yeah, that seems like a prime candidate to get huge speedups since the input is basically a vector, you do a cov_exp_quad and a cholesky on the GPU and return the matrix. But we need to add that C++ function to Stan Math first, or is that already happening?

Caching wont help us there.

No. Not yet there… but I think this is obvious that we want this… unless you find a good way which keeps things modular, but does the magic in a single go anyways (expression templates?).

1 Like