MPI has the bottleneck of costly inter process communication. So a hierarchical model where the computation cost per unit is big are a natural candidate to speed things up seriously. The link I posted was the best benchmark I ever saw! But it is a real model for one of our projects. The thing is that an ODE has to be solved per subject - and ODEs are amazingly costly to solve. Hence, the communcation involved is not a lot in comparison and gives an incredible speedup which takes >2.5days walltime down to 1h walltime on 80 cores distributed on 8 machines.

# Stan on GPU: looking for model+dataset examples for empirical evaluation of speedups

**Erik_Strumbelj**#23

Ah, of course, you are right. I keep forgetting about the gradient and auto-diff.

**fabio**#24

@Erik_Strumbelj these my 2 cents that will follow here are not a priority at all. Put it at the last place on the list: I am waiting patiently for the time when I can compare the spatio-temporal models (both areal and continuous) as the one that can be (rather quickly) be fitted by INLA with the same model fitted by STAN.

If you have time, take a look at chapter 7 (page 235- 253) of the *Spatial and Spatio-temporal Bayesian Models with R-INLA* by Blangiardo & Cameletti (Wiley, 2015). The datasets to work with are standard packaged in R.

**Erik_Strumbelj**#25

Quick update: Weāve tested what is currently parallelized in stanmathcl (with some additions that speedup Gaussian density calculation with Cholesky factor & derivative) on a Gaussian process regression model.

The model used was Betancourtās gp3.stan model from https://betanalpha.github.io/assets/case_studies/gp_part3/part3.html and we ran it on similar dataset with n observed points and n points to predict.

Here are the speedups (and times) as n increases:

So, we can get 12+ speedups with large enough n, possibly more, but we didnāt check, because CPU version takes more than a day to run. GPU suffers on small n, due to some overhead and data transfer cost, which we hope to address in the future.

(Edit: The setup was an Intel Xeon CPU running at 2.3GHz and a NVIDIA GTX1070)

With log10(n) on x-axis:

**avehtari**#26

Can you make another plot with log(n) in x axis? If both axes are in log scale then itās easier to see, e.g., if scaling differs from theoretical O(n^3)

**Bob_Carpenter**#28

Just responding to the first post that cuaght my eye.

This all looks great!

GPs were the main motivating example for Cholesky.

Log determinant comes in with any multivariate normal. I donāt know if itās specialized for the triangular (post-Cholesky transform) case.

For matrix-vector multiply, thereās basically any GLM.

For matrix-matrix multiply, Iād look to simple factor models (though those are hard to validate given label switching without ordering constraints).

**wds15**#29

This looks very promising. A few questions:

- Is the GPU running in double of single precision mode?
- Is that GPU any good for double precision?
- How do results compare against the CPU version? How does this comparison fare under single and double precision for the GPU calculations?

ā¦ I am just curiousā¦

What speedups for the multi normal have you done? We have just committed a version of the multi_normal_cholesky which uses analytic gradients to stan-math (and this version gives for d~250 roughly a 2x speed increase of the gradient evaluation time vs the code in 2.17.1; with higher dimensions things should get even better).

Thanks for more info!

**Erik_Strumbelj**#30

Answers to @wds15ās questions:

Is the GPU running in double of single precision mode?

There was some concern that single precision might not be enough for Stan/HMC, so, for now, everything is double precision.

Is that GPU any good for double precision?

The GPU is high-end. However, its double precision performance is not that great. Theoretically, the single precision performance is ~32x better than double precision performance. However, NVIDIA uses some techniques where the single precision floating point units āhelpā the double precision floating point units. The full details on this are not known. We are investigating whether we can influence this in any way. We are also investigating what happens on low-end GPUs, pure scientific GPUs and GPUs of other vendors. (thanks to @rok_cesnovar for this answer)

How do results compare against the CPU version?

We didnāt see any practical difference (visual summary of posterior was the same for GPU and CPU), but we didnāt dig deeper (like compare each sample). They should be āexactlyā the same. Weāll have a closer look.

(edit: We also check each individual component by comparing its output to its stan-math counterpart. Errors in the order of E-14, E-15, but this can be expected due to differences in GPU/CPU number representation standards).

How does this comparison fare under single and double precision for the GPU calculations?

We donāt know (yet).

(with some additions that speedup Gaussian density calculation with Cholesky factor & derivative) What speedups for the multi normal have you done? We have just committed a version of the multi_normal_cholesky which uses analytic gradients to stan-math (and this version gives for d~250 roughly a 2x speed increase of the gradient evaluation time vs the code in 2.17.1; with higher dimensions things should get even better).

We focused on speeding up solving Ax=b, where A is a lower triangular matrix (lines 276 to 281 in https://github.com/stan-dev/math/blob/develop/stan/math/rev/mat/fun/mdivide_left_tri.hpp), which was by far biggest bottleneck for large n. In order to be faster than the CPU for smaller n, we are now focusing on rest of the stan-math that is called in /prim and /rev from this model. Could you point us to the changes that were made? We would love to check what can be gained by parallelizing the new version. We used the 2.17 version. (thanks to @rok_cesnovar for this answer)

**wds15**#31

The old implementation relied on AD for the gradients of the multi variate normal. The new version here calculates the gradients analytically.

The changes I have done there are tailored for GP problems, I think. However, the CPU version got a lot faster with that, so the GPU may have a harder time to outperform. In any case I am quite curious on benchmarks involving GPs and the new multi normal cholesky. For the exact changes to that file, just look at the history of that file.

**ahartikainen**#32

Wikipedia has nice tables for single vs double.

GTX Titan X Black / Titan V could be great for Stan.

**wds15**#33

Uhmā¦ reviewing the multi normal code once more lead to one more improvement such that the code is an additional 40% faster now. So the speed gain was 2.3x vs vanilla 2.17.1 and now I measured around 3.2x speedup. This commit is 15 minutes old.

Now the function does a single mdivide left tri to get the inverse of the cholesky factor and then its only matrix vector products. The inverse of the cholesky factor is needed for the gradient calculation anyways.

**jburton**#34

I would like to cast a vote, so to speak, for hierarchical models based on nonlinear (logistic) regression. I have been in contact with Rok intermittently and have provided a model to him, although I canāt provide the data.

**Bob_Carpenter**#35

Thatās pretty much our first go-to example for testing. Michael Betancourt put together a bunch of solid test cases in a testing repo on stan-dev. You still have to be careful about priors and/or data sizes.

**jburton**#36

Hi Bob,

I should clarify that Iām fitting data to a generalized logistic function. I think logistic regression typically refers to fitting data with binary outcomes, which Iām not doing here.

**maedoc**#37

Quick comment: it seems youāre benchmarking with a GTX 1070 which is fairly low end for GPU work. Itād be good to benchmark on HPC standard GPUs, e.g. P100, as well which benefit from better memory bandwidth as well

**Bob_Carpenter**#38

Thanks for letting me know. I found the Wikipedia article on the generalized logistic function.