Speedup by using external BLAS/LAPACK with CmdStan and CmdStanR/Py

I should have been doing something else, but here goes

1. Introduction

  • different BLAS / LAPACK libraries introduce another non-reproducibility issue, as the computations are not bit by bit the same
  • To get MKL to work with, e.g. Cholesky, liblapacke needs to be replaced with MKL. Instead of at linking time selecting which library to use, Linux systems allow alternatives that can easily change which library is used to present common libraries. To get MKL and OpenBLAS to work better you need to set them as alternatives for liblapacke, which needs a bit of work. Or you can replace -llapacke with -openblas or -lmkl_rt at linking time.
  • to get a significant speed improvent, there needs to be big matrix operations, e.g. matrix-matrix-products (handled by BLAS) or Cholesky (or QR, SVD, etc, handled by LAPACK)
  • it is likely that the profiler can be used to find the cases where big matrix operations dominate the target computation
  • these are easy to test as this doesn’t require changes to the code (like sum_reduce parallelization requires)
  • I tested with several models, but nothing really slow that would take hours or days, but these already give a good idea when this is likely to be useful and when not.

2. normal linear regression with N=1e4, p=100

2.1. Stan code with y ~ normal(alpha + x*beta, sigma);

  • Eigen without march/mtune: 24s
  • Eigen: 20s
  • OpenBLAS: 17s
  • OpenBLAS w. 2 threads: 12s
  • MKL: 20s
  • MKL w. 2 threads: 16s

Small 15% time reduction using OpenBLAS, and with two threads 40% improvement. MKL doesn’t provide any time reduction and only 20% time reduction with two threads.

2.2. Stan code with y ~ normal_id_glm(x, alpha, beta, sigma);

  • Eigen without march/mtune: 20s
  • Eigen: 13s
  • OpenBLAS: 15s
  • OpenBLAS w. 2 threads: 7s
  • MKL: 12s
  • MKL w. 2 threads: 7s

OpenBLAS is slower than Eigen, but with two threads provides 46% time
reduction. MKL provides small 8% time reduction, but with two threads
provides 46% time reduction. Using parallel threads doesn’t need any
sum_reduce or any change in the code. The matrix is quite big 10000 x
100.

The great 2 thread performance was not observed with y ~ normal(alpha + x*beta, sigma);, so it seems that the best benefits will be got when the autodiff is not involved (what will happen with varmat?)

2.3. Stan code generated with brm(y ~ ., ...)

  • Eigen without march/mtune: 26s
  • Eigen: 14s
  • Eigen + sum_reduce w. 2 threads: 16s
  • OpenBLAS: 15s
  • OpenBLAS w. 2 threads: 6s
  • OpenBLAS + sum_reduce w. 2 threads: 12s
  • MKL: 16s
  • MKL w. 2 threads: 7s

brms speed is on par with hand coded code. Running with more sum_reduce threads deosn’t help, while using more threads in BLAS provides time reduction 50-60% time reduction.

3. GP with covariance matrix N=266

3.1. Stan code with L_f = cholesky_decompose(K_f); yn ~ multi_normal_cholesky(zeros, L_f);

  • Eigen without march/mtune: 95s
  • Eigen: 53s
  • OpenBLAS: 58s
  • OpenBLAS w. 2 threads: 59s
  • MKL: 51s
  • MKL w. 2 threads: 45s
  • MKL w. 3 threads: 43s

OpenBLAS provides 9% time increase. MKL provides negligible time reduction, unless more threads are used to get up to 19% time reduction.

3.2. Stan code with yn ~ multi_normal(zeros, K_f);

  • Eigen without march/mtune: 110s
  • Eigen: 59s
  • OpenBLAS: 56s
  • OpenBLAS w. 2 threads: 50s
  • MKL : 48s
  • MKL w. 2 threads: 44s
  • MKL w. 3 threads: 40s

First it is notable that when using Eigen, yn ~ multi_normal(zeros, K_f); is 11% slower than L_f = cholesky_decompose(K_f); yn ~ multi_normal_cholesky(zeros, L_f);

OpenBLAS provides 5% time reduction, MKL provides 19% time reduction. With more threads we can get up to 32% time reduction. The matrix is still quite small, and using more threads gives only a small additional time reduction.

A bigger time reduction was obtained when using a compound function (multi_normal vs cholesky_decompose + multi_normal_cholesky).

4. GP with covariance matrix N=2128

This is using BFGS which makes only 23 target evaluations to converge

4.1. Stan code with L_f = cholesky_decompose(K_f); yn ~ multi_normal_cholesky(zeros, L_f);

  • Eigen 19s
  • OpenBLAS 20s
  • OpenBLAS w. 2 threads 15s
  • OpenBLAS w. 3 threads 13s
  • MKL 19s
  • MKL w. 2 threads 14s
  • MKL w. 3 threads 12s

No difference between Eigen, OpenBLAS and MKL, but OpenBLAS and MKL
can provide 21-37% time reduction when using more threads.

4.2. Stan code with yn ~ multi_normal(zeros, K_f);

  • Eigen 32s
  • OpenBLAS 30s
  • OpenBLAS w. 2 threads 21s
  • OpenBLAS w. 3 threads 18s
  • MKL 30s
  • MKL w. 2 threads 21s
  • MKL w. 3 threads 18s

5. brms hierarchical heteroscedastic linear regression with N=11245, p=1, grouping x | ID

This brms model was suggested by Donald Williams

bf(rt ~ congruency + (congruency | x | ID), sigma ~ congruency + (congruency | x | ID))
  • Eigen without march/mtune: 260s
  • Eigen: 250s
  • Eigen + sum_reduce w. 2 threads: 123s
  • OpenBLAS: 236s
  • OpenBLAS w. 2 threads: 234s
  • OpenBLAS + sum_reduce w. 2 threads: 120s
  • OpenBLAS w. 2 threads + sum_reduce w. 2 threads 125s
  • MKL: 237s
  • MKL w. 2 threads : 234s
  • MKL + sum_reduce w. 2 threads: 124s

OpenBLAS and MKL provides 5% time reduction. Additional threads for OpenBLAS or MKL did not help. sum_reduce gives 50% improvement. This hierarchical model has only a vector (length 11245) times scalar product and couple for loops from 1 to N. It is natural that external BLAS/LAPACK don’t help, but sum_reduce works well.

6. Conclusion

Models with big matrix-vector or matrix-matrix products, or matrix decompositions can get significant time reductions. With one thread, -march=native -mtune=native should be always used when compiling CmdStan and models. For single thread, small time-reductions can be obtained with external BLAS/LAPACK, but for most people it’s not worth the trouble of installing and configuring. For those who have big matrix operations, the external packages can provide speedupds at least up to 60% by using more threads (probably more with better CPUs than what I have in my laptop). Eigen documentation also mentions possibility of multithreading within Eigen, but it’s with openmp, and I couldn’t get it working (while OpenBLAS and MKL support pthreads that worked without any additional tricks). Bigger time reductions are observed with compound functions, which hints that varmat could also get bigger time reductions than matvar.

EDIT1: typo fixes
EDIT2: added info what MKL is missing
EDIT3: updated the results with -march=native -mtune=native

13 Likes