Speedup by using external BLAS/LAPACK with CmdStan and CmdStanR/Py

avehtari · November 26, 2021, 6:59pm

I should have been doing something else, but here goes

1. Introduction

different BLAS / LAPACK libraries introduce another non-reproducibility issue, as the computations are not bit by bit the same
To get MKL to work with, e.g. Cholesky, liblapacke needs to be replaced with MKL. Instead of at linking time selecting which library to use, Linux systems allow alternatives that can easily change which library is used to present common libraries. To get MKL and OpenBLAS to work better you need to set them as alternatives for liblapacke, which needs a bit of work. Or you can replace -llapacke with -openblas or -lmkl_rt at linking time.
to get a significant speed improvent, there needs to be big matrix operations, e.g. matrix-matrix-products (handled by BLAS) or Cholesky (or QR, SVD, etc, handled by LAPACK)
it is likely that the profiler can be used to find the cases where big matrix operations dominate the target computation
these are easy to test as this doesn’t require changes to the code (like sum_reduce parallelization requires)
I tested with several models, but nothing really slow that would take hours or days, but these already give a good idea when this is likely to be useful and when not.

2. normal linear regression with N=1e4, p=100

2.1. Stan code with `y ~ normal(alpha + x*beta, sigma);`

Eigen without march/mtune: 24s
Eigen: 20s
OpenBLAS: 17s
OpenBLAS w. 2 threads: 12s
MKL: 20s
MKL w. 2 threads: 16s

Small 15% time reduction using OpenBLAS, and with two threads 40% improvement. MKL doesn’t provide any time reduction and only 20% time reduction with two threads.

2.2. Stan code with `y ~ normal_id_glm(x, alpha, beta, sigma);`

Eigen without march/mtune: 20s
Eigen: 13s
OpenBLAS: 15s
OpenBLAS w. 2 threads: 7s
MKL: 12s
MKL w. 2 threads: 7s

OpenBLAS is slower than Eigen, but with two threads provides 46% time
reduction. MKL provides small 8% time reduction, but with two threads
provides 46% time reduction. Using parallel threads doesn’t need any
sum_reduce or any change in the code. The matrix is quite big 10000 x
100.

The great 2 thread performance was not observed with y ~ normal(alpha + x*beta, sigma);, so it seems that the best benefits will be got when the autodiff is not involved (what will happen with varmat?)

2.3. Stan code generated with `brm(y ~ ., ...)`

Eigen without march/mtune: 26s
Eigen: 14s
Eigen + sum_reduce w. 2 threads: 16s
OpenBLAS: 15s
OpenBLAS w. 2 threads: 6s
OpenBLAS + sum_reduce w. 2 threads: 12s
MKL: 16s
MKL w. 2 threads: 7s

brms speed is on par with hand coded code. Running with more sum_reduce threads deosn’t help, while using more threads in BLAS provides time reduction 50-60% time reduction.

3. GP with covariance matrix N=266

3.1. Stan code with `L_f = cholesky_decompose(K_f); yn ~ multi_normal_cholesky(zeros, L_f);`

Eigen without march/mtune: 95s
Eigen: 53s
OpenBLAS: 58s
OpenBLAS w. 2 threads: 59s
MKL: 51s
MKL w. 2 threads: 45s
MKL w. 3 threads: 43s

OpenBLAS provides 9% time increase. MKL provides negligible time reduction, unless more threads are used to get up to 19% time reduction.

3.2. Stan code with `yn ~ multi_normal(zeros, K_f);`

Eigen without march/mtune: 110s
Eigen: 59s
OpenBLAS: 56s
OpenBLAS w. 2 threads: 50s
MKL : 48s
MKL w. 2 threads: 44s
MKL w. 3 threads: 40s

First it is notable that when using Eigen, yn ~ multi_normal(zeros, K_f); is 11% slower than L_f = cholesky_decompose(K_f); yn ~ multi_normal_cholesky(zeros, L_f);

OpenBLAS provides 5% time reduction, MKL provides 19% time reduction. With more threads we can get up to 32% time reduction. The matrix is still quite small, and using more threads gives only a small additional time reduction.

A bigger time reduction was obtained when using a compound function (multi_normal vs cholesky_decompose + multi_normal_cholesky).

4. GP with covariance matrix N=2128

This is using BFGS which makes only 23 target evaluations to converge

4.1. Stan code with `L_f = cholesky_decompose(K_f); yn ~ multi_normal_cholesky(zeros, L_f);`

Eigen 19s
OpenBLAS 20s
OpenBLAS w. 2 threads 15s
OpenBLAS w. 3 threads 13s
MKL 19s
MKL w. 2 threads 14s
MKL w. 3 threads 12s

No difference between Eigen, OpenBLAS and MKL, but OpenBLAS and MKL
can provide 21-37% time reduction when using more threads.

4.2. Stan code with `yn ~ multi_normal(zeros, K_f);`

Eigen 32s
OpenBLAS 30s
OpenBLAS w. 2 threads 21s
OpenBLAS w. 3 threads 18s
MKL 30s
MKL w. 2 threads 21s
MKL w. 3 threads 18s

5. brms hierarchical heteroscedastic linear regression with N=11245, p=1, grouping x | ID

This brms model was suggested by Donald Williams

bf(rt ~ congruency + (congruency | x | ID), sigma ~ congruency + (congruency | x | ID))

Eigen without march/mtune: 260s
Eigen: 250s
Eigen + sum_reduce w. 2 threads: 123s
OpenBLAS: 236s
OpenBLAS w. 2 threads: 234s
OpenBLAS + sum_reduce w. 2 threads: 120s
OpenBLAS w. 2 threads + sum_reduce w. 2 threads 125s
MKL: 237s
MKL w. 2 threads : 234s
MKL + sum_reduce w. 2 threads: 124s

OpenBLAS and MKL provides 5% time reduction. Additional threads for OpenBLAS or MKL did not help. sum_reduce gives 50% improvement. This hierarchical model has only a vector (length 11245) times scalar product and couple for loops from 1 to N. It is natural that external BLAS/LAPACK don’t help, but sum_reduce works well.

6. Conclusion

Models with big matrix-vector or matrix-matrix products, or matrix decompositions can get significant time reductions. With one thread, -march=native -mtune=native should be always used when compiling CmdStan and models. For single thread, small time-reductions can be obtained with external BLAS/LAPACK, but for most people it’s not worth the trouble of installing and configuring. For those who have big matrix operations, the external packages can provide speedupds at least up to 60% by using more threads (probably more with better CPUs than what I have in my laptop). Eigen documentation also mentions possibility of multithreading within Eigen, but it’s with openmp, and I couldn’t get it working (while OpenBLAS and MKL support pthreads that worked without any additional tricks). Bigger time reductions are observed with compound functions, which hints that varmat could also get bigger time reductions than matvar.

EDIT1: typo fixes
EDIT2: added info what MKL is missing
EDIT3: updated the results with -march=native -mtune=native

Topic		Replies	Views
Stan compile with Apple Accelerate BLAS / LAPACK? General compiler	4	1248	January 19, 2024
Compiling Stan against Intel MKL General	9	2048	October 16, 2017
Compiling Stan with openBLAS (or similar) through Rstan RStan	3	1049	October 19, 2017
CmdStan: CPU faster than GPU? General cmdstan	6	2272	February 18, 2021
Rebuilding CmdStan with OpenCL on MacOS Interfaces cmdstan , cmdstanr	6	700	November 1, 2020

Speedup by using external BLAS/LAPACK with CmdStan and CmdStanR/Py

1. Introduction

2. normal linear regression with N=1e4, p=100

2.1. Stan code with y ~ normal(alpha + x*beta, sigma);

2.2. Stan code with y ~ normal_id_glm(x, alpha, beta, sigma);

2.3. Stan code generated with brm(y ~ ., ...)

3. GP with covariance matrix N=266

3.1. Stan code with L_f = cholesky_decompose(K_f); yn ~ multi_normal_cholesky(zeros, L_f);

3.2. Stan code with yn ~ multi_normal(zeros, K_f);