Speedup by using external BLAS/LAPACK with CmdStan and CmdStanR/Py

Stan uses Eigen for many matrix computations. Eigen has internal BLAS and LAPACK matrix computation routines. There is opportunity to get some speedup by using external libraries that provide BLAS and LAPACK routines. This speedup can be obtained without any changes in Stan code, but requires recompilation.

I tested with a few examples and got 0%-40% speedups using OpenBLAS.

How to enable external BLAS and LAPACK:

  • check that you have some BLAS and LAPACK installed (linux systems are likely to have BLAS and LAPACK by Netlib installed by default, but in my experiments Netlib BLAS was slower than Eigen internal BLAS)
  • From several free options, OpenBLAS was the fastest in my experiments
  • Add following lines to CmdStan make/local
CXXFLAGS += -DEIGEN_USE_BLAS -DEIGEN_USE_LAPACKE
LDLIBS += -lblas -llapack -llapacke
  • compile
  • at least in some linux systems you can easily change which libraries are used for BLAS and LAPACK (e.g. I compared libraries by Netlib, OpenBLAS, BLIS, ATLAS, and Intel MKL, (although MKL failed as replacement for liblapack))
sudo update-alternatives --config libblas.so.3-x86_64-linux-gnu
sudo update-alternatives --config liblapack.so.3-x86_64-linux-gnu
  • e.g. OpenBLAS and Intel MKL can use more than one thread also for BLAS/LAPACK, but I got the best results running with 1 OpenBLAS or MKL thread. You can set the number of threads with environment variables OPENBLAS_NUM_THREADS and MKL_NUM_THREADS. Or if 1 thread is good, you can choose OpenBLAS-serial library.

If you are using cmdstanr, you can modify the file make/local and rebuild CmdStan from R:

cpp_options = list("CXXFLAGS += -DEIGEN_USE_BLAS -DEIGEN_USE_LAPACKE", "LDLIBS += -lblas -llapack -llapacke")
cmdstanr::cmdstan_make_local(cpp_options = cpp_options, append = TRUE)
cmdstanr::rebuild_cmdstan(cores = 4)

You can use Intel MKL with the above approach, but might get a bit more speedup using direct calls and vectorized functions, although I didn’t see much difference to OpenBLAS. To get the additional MKL features, add for example the following (where I intentionally chose sequential ie no parallel threads). With this approach you lose the easy way to switch (but you could have different CmdStan versions in different directories).

CXXFLAGS += -DEIGEN_USE_MKL_ALL -I"/usr/include/mkl"
LDLIBS += -lmkl_intel_lp64 -lmkl_sequential -lmkl_core

Note: Intel MKL is a proprietary software and it is the responsibility of users to buy or register for community (free) Intel MKL licenses for their products. Moreover, the license of the user product has to allow linking to proprietary software that excludes any unmodified versions of the GPL.

More Linux (Ubuntu) specific information: I installed using Synaptic package manager (a graphical APT interface)

  • liblapacke-dev, liblapacke (lapacke required by Eigen to support external LAPACK)
  • libopenblas-dev, libopenblas-pthread-dev, libopenblas-serial-dev, libopenblas0, libopenblas0-pthread, libopenblas0-serial (I installed both pthread parallel and serial versions to be able to compare and as in some copmarison serial was faster than phtread with one thread)
  • intel-mkl, libmkl-dev, libmkl-threading-dev, libmk-sequential and everything synaptic recommended (20+ packages)

*-dev packages are a bit misleadingly named as it sounds like you would need if you develop those packages or some other packages, but they are need always when compiling a program that calls the corresponding library, and as Stan models are compiled we need them, too.

I also installed BLIS and ATLAS, but they did perform worse than OpenBLAS.

More information

If you try this, please report results here.

EDIT: added information about which APT packages I installed to get this working

18 Likes

Thanks a lot for this tutorial !

I’ve tried it with OpenBLAS (through CmdStanR), but I’m having some issues:

  1. Compilation fails due to not finding -llapacke (error: /bin/ld: cannot find -llapacke)
  2. If I remove this argument, it compiles successfully and my test model runs, but very slowly: I get ~4x slower sampling without within chain parallelization, and ~6x slower with parallelization.

Some potentially relevant info:

  • CmdStan 2.28.2
  • I’m using WSL2 on W11 (Ubuntu 20.04.3 LTS - GNU/Linux 5.10.60.1-microsoft-standard-WSL2 x86_64)
  • CPU is Ryzen 5950x
  • Other cpp_options I use are: list(STAN_THREADS = TRUE, PRECOMPILED_HEADERS = TRUE, STAN_CPP_OPTIMS = TRUE)

Disclaimer: Total noob at BLAS stuff, I have no idea what I’m doing.

Edit:

  • Model is a simple Bernouilli GLM with 10 rnorm() predictors, and I use brms to generate the stan code and data.
  • OpenBLAS was installed with sudo apt-get install libopenblas-dev
  • Both BLAS and LAPACK are set to OpenBLAS (/usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 and /usr/lib/x86_64-linux-gnu/openblas-pthread/liblapack.so.3)
  • I have not changed the default OPENBLAS_NUM_THREADS (I have no idea what it’s doing)
1 Like

You need to install also liblapacke (note the e in the end). It’s also available via apt-get. lapacke provides interface between Eigen and lapack libraries.

For brms it seems that it’s best to set OPENBLAS_NUM_THREADS=1. I’ve also installed openblas-serial, which always use just one thread. By default OPENBLAS_NUM_THREADS is 4 or 8 (I can’t remember) and if in addition you run 4 chains parallel, you get 16-32 threads which leads to high inefficiency.

1 Like

Ah, my bad, I thought this library was included in OpenBLAS.
I installed it, but I still get the same error at compilation. Maybe I need to provide liblapacke location in a way ?

LAPACK support is included, but not LAPACKE which is needed between Eigen and LAPACK. I will clarify this in the original post later today.

In my Ubuntu, the package when to the usual place, and I didn’t need to provide anything else. Without seeing what you included in make/local, and the specific error, it’s difficult to help more. Check that you have the correct number of l’s.

If it helps, here are a few screenshots of the cpp_options I use, plus the results of the “checks” you mentioned in your post.

Ah, I found the problem: I also needed to install liblapacke-dev (and potentially libopenblas-base, but I suspect the first one was the culprit). Thanks again for your help (I’m pretty new to Linux overall).

Gonna try to see if model runtime is better now.

1 Like

I’ve used Linux for a long time, but helps only a little when there are so many different packages that may need to be installed. I’ll add more details what I installed.

1 Like

please tell more. which models gave 40% speedups?

where do we need 40% speedups the most?

2 Likes

I should have been doing something else, but here goes

1. Introduction

  • different BLAS / LAPACK libraries introduce another non-reproducibility issue, as the computations are not bit by bit the same
  • MKL didn’t work with Cholesky. MKL is missing LAPACK cgesvdq which computes “SVD with a QR-Preconditioned QR SVD Method for GE matrices”, which is used by cholesky_decompose or multi_normal_cholesky
  • to get a significant speed improvent, there needs to be big matrix operations, e.g. matrix-matrix-products (handled by BLAS) or Cholesky (or QR, SVD, etc, handled by LAPACK)
  • it is likely that the profiler can be used to find the cases where big matrix operations dominate the target computation
  • these are easy to test as this doesn’t require changes to the code (like sum_reduce parallelization requires)
  • I tested with several models, but nothing really slow that would take hours or days, but these already give a good idea when this is likely to be useful and when not.

2. Normal linear regression with N=1e4, p=100

2.1. Stan code with y ~ normal(alpha + x*beta, sigma);

  • Eigen: 24s
  • OpenBLAS: 22s
  • OpenBLAS w. 2 threads: 22s
  • MKL: 22s
  • MKL w. 2 threads: 18s

Small 8% time reduction using OpenBLAS. Using more OpenBLAS threads doesn’t help. Two MKL threads help another 8%, which is lousy improvement for doubling the resources.

2.2. Stan code with y ~ normal_id_glm(x, alpha, beta, sigma);

Eigen: 20s
OpenBLAS: 19s
OpenBLAS w. 2 threads: 10s
MKL: 20s
MKL w. 2 threads: 11s

Small 5% time reduction using OpenBLAS with 1 thread, but excellent 50%
time reduction with 2 threads (near optimal). And this doesn’t need any
sum_reduce or any change in the code. The matrix is quite big 10000 x 100.

The great 2 thread performance was not observed with y ~ normal(alpha + x*beta, sigma);, so it seems that the best benefits will be get when the autodiff is not involved (what will happen with varmat?)

2.3 Stan code generated with brm(y ~ ., ...)

Eigen: 36s
Eigen + sum_reduce w. 2 threads: 24s
OpenBLAS: 25s
OpenBLAS w. 2 threads: 31s
OpenBLAS + sum_reduce w. 2 threads: 15s

brms is much slower than simpler hand-coded Stan code. brms uses sum_reduce, which has its own overhead, too. OpenBLAS provides 30-40% time reduction, which seems here to be independent of sum_reduce 2 thread time reduction, which is also 30-40%. Using more threads for OpenBLAS increases the time by 25%.

3. GP with covariance matrix N=266

3.1. Stan code with L_f = cholesky_decompose(K_f); yn ~ multi_normal_cholesky(zeros, L_f);

Eigen: 95s
OpenBLAS: 65s
OpenBLAS w. 2 threads: 59s
MKL (BLAS) + OpenBLAS (LAPACK): 56s
MKL (BLAS) w. 2 threads + OpenBLAS (LAPACK) w. 2 threads: 61s

OpenBLAS provides 32% time reduction. MKL + OpenBLAS provides 41% time reduction. The matrix is still quite small, and using more threads gives only small additional time reduction.

3.2. Stan code with yn ~ multi_normal(zeros, K_f);

Eigen: 110s
OpenBLAS: 56s
OpenBLAS w. 2 threads: 54s
MKL (BLAS) + OpenBLAS (LAPACK): 47s
MKL (BLAS) w. 2 threads + OpenBLAS (LAPACK) w. 2 threads: 44s

First it is notable that when using Eigen, yn ~ multi_normal(zeros, K_f); is 16% slower than L_f = cholesky_decompose(K_f); yn ~ multi_normal_cholesky(zeros, L_f);

OpenBLAS provides 49% time reduction, MKL + OpenBLAS provides 57% time reduction (even MKL is not used for Cholesky). The matrix is still quite small, and using more threads gives only a small additional time reduction.

Again, a bigger time reduction was obtained when using a compound function (multi_normal vs cholesky_decompose + multi_normal_cholesky).

4. brms hierarchical heteroscedastic linear regression with N=11245

This brms model was suggested by Donald Williams

bf(rt ~ congruency + (congruency | x | ID),
   sigma ~ congruency + (congruency | x | ID))

Eigen: 260s
Eigen + sum_reduce w. 2 threads: 123s
OpenBLAS: 250s
OpenBLAS w. 2 threads: 250s
OpenBLAS + sum_reduce w. 2 threads: 120s
OpenBLAS w. 2 threads + sum_reduce w. 2 threads 125s
MKL (BLAS) + OpenBLAS (LAPACK): 240s
MKL (BLAS) + OpenBLAS (LAPACK) + sum_reduce w. 2 threads: 124s

OpenBLAS provides 4% time reduction, MKL provides 8% time reduction. Additional threads for OpenBLAS or MKL did not help. sum_reduce gives 53% improvement. This hierarchical model has only a vector (length 11245) times scalar product and couple for loops from 1 to N. It is natural that external BLAS/LAPACK don’t help, but sum_reduce works well.

5. Conclusion

Models with big matrix-vector or matrix-matrix products, or matrix decompositions can get significant time reductions. Even with just one thread, up to 57% time reduction was observed in the experiments. Bigger time reductions are observed with compound functions, which hints that varmat could also get bigger time reductions than matvar.

EDIT1: typo fixes
EDIT2: added info what MKL is missing

7 Likes