I was wondering what types of models would expect to gain a speed up from GPU support. For example, would models like this be expected to get a boost?

My summary would be that if you have fairly large matrices (especially data matrices) involved anywhere in the code, then the GPU may be something to look into.This includes `multi_normal`

's \Sigma matrix in your link.

The exact definition of â€ślargeâ€ť will continue to evolve as these systems change, but there are a few different places where people have published some performance results on them. One is here on page 12: https://arxiv.org/pdf/1907.01063.pdf. These graphs do not take into account the matrix transfer between CPU and GPU so are more indicative of performance youâ€™d see with a data matrix and not a parameter matrix. The `var`

version will probably be the limiting factor and that suggests you start seeing only a ~5x speedup near matrices of size 5000x5000.

@rok_cesnovar do you know if there other places benchmark results for GPUs vs CPUs are published?

Currently, you get the most benefit when using one of the 6 GLM functions: normal_id_glm_lpdf, bernoulli_logit_glm_lpmf, categorical_logit_glm_lpmf, neg_binomial_2_log_glm_lpmf, ordered_logistic_glm_lpmf, poisson_log_glm_lpmf

For these, you can expect speedups for even modest-sized input sizes.

Besides that you can expect speedups with larger matrices (1k x 1k or larger) for models that use:

- multiply
- mdivide_left_tri
- mdivide_right_tri
- cholesky_decompose

It will obviously depend on the exact CPU/GPU you have.

Hmm, would a hypothetical `multi_normal_id_glm_lpdf`

get the same gains? Then that would help @CerulloEâ€™s class of model as well (I believe). I could also use such a thing, maybe Iâ€™ll take a shot at that this weekendâ€¦

Yes, right now the OpenCL back-end basically only parallelizes a single node in the computational graph for AD. And the more computationally complex the node is, the more you can gain. Plus the GLMs use the operands and partials struct that also makes the reverse mode steps a bit more optimal.

The only problem for GLMs is the backend currently requires x and y to be data for all GLMs.

But wait, there is moreâ€¦ this is all just about to change. This time for real :)

Tadej, Steve and Ben are working on the Math backend for `var_value`

, that will enable us to move entire subgraphs of the AD computational graph to the GPU. I am slowly working on supporting that for stanc3.

The first step will be once https://github.com/stan-dev/math/pull/2034 and the stanc3 PR will be merged https://github.com/stan-dev/stanc3/pull/689

This will enable OpenCL GLMs even when x or y are parameters. But more importantly, if the input function will be something that has OpenCL support it will run both functions on the GPU without intermediate transfers. And that will open a whole new world. Models will be faster on a GPU then even without these big nodes.

I would recommend looking at Tadejâ€™s PR (https://github.com/stan-dev/math/pull/2034), but it has a bunch of changes, so probably just take a look at one GLM: https://github.com/stan-dev/math/blob/15eb6a289033345dc8c2fe28003e4dbf70b73e51/stan/math/opencl/prim/normal_id_glm_lpdf.hpp

Its all done with the kernel generator so you would most likely not have to write any OpenCL kernel code.

Whoa. Is that written up somewhere? That sounds very cool, curious how it works.

I am not sure there is a central doc for this (I think there is an idea floating around to do another Stan Math paper after this is fully in). Just search for â€śvar_valueâ€ť PRs/issues in math to get an idea.

The basic idea is that instead of a `Eigen::Matrix<var>`

you have `var_value<Eigen::Matrix<double>>`

(a var that has two Eigen matrices). This alone will make Stan a lot faster for the CPU with better memory efficieny (see https://bbbales2.shinyapps.io/math-shiny-perf/ for some preliminary performance figures, some are fantastic, some need a some more work). Code gen will have to be reworked a bit to enable this, but I have the basic stuff ready.

Having all this means we can do `var_value<matrix_cl<double>>`

. And two contiguous matrices (values and adjoints) are simple to handle for OpenCL. This was not true for matrices of vars.

For now we just have implementations for copying to `var_value<matrix_cl>`

, but even that will give you an idea: https://github.com/stan-dev/math/blob/develop/stan/math/opencl/rev/copy.hpp

The GLMs will follow next week, then I think Tadej has a multiply implementation ready and then we can finally proceed with the rest we promised 2 years ago in Helsinki. Code gen for this will have to be reworked a bit as well, but that overlaps with core var_value changes so that should not be that much additional work.

p.s.: there is also another development that simplifies all this - the reworked adj_jac_apply (see https://github.com/stan-dev/math/pull/2011 for details) that introduces reverse_pass_callback. Have a look at this neat dot_product rev implementation: https://github.com/stan-dev/math/blob/develop/stan/math/rev/fun/dot_product.hpp

I am really hyped for all of this. Just wish I had more time right now to work on this right now.

Currently, what level of OpenCL support do you expect will be required from a userâ€™s GPU? Is that minimum level of support something that would be expected to be changed (in terms of requiring a higher level / version) anytime in the near future? Thanks!

Looks like the minimum level of support is OpenCL is 1.2 per the paper above.

There is a design doc here. As Rok said the long and short of it is just moving from an array of structs to struct of arrays. As long as subsets of a matrix are not assigned to the matrix can be represented as `var_value<Eigen::Matrix<double, R, C>>`

and `var_value<matrix_cl<double>>`

. Thereâ€™s some really finicky things with getting nice slices out but besides that itâ€™s going pretty smoothly!