Speedup by using external BLAS/LAPACK with CmdStan and CmdStanR/Py

Stan uses Eigen for many matrix computations. Eigen has internal BLAS and LAPACK matrix computation routines. There is opportunity to get some speedup by using external libraries that provide BLAS and LAPACK routines. This speedup can be obtained without any changes in Stan code, but requires recompilation.

EDIT2: I had not realized CXXFLGAS += -march=native -mtune=native are not enabled by default when compiling CmdStan, and most of the speed difference in single thread case is explained by that. I’ve updated the whole post and later reply with new results.

I tested with a few examples and got 0%-60% speedups using OpenBLAS and Intel MKL.

How to enable external BLAS and LAPACK:

  • check that you have some BLAS and LAPACK installed (linux systems are likely to have BLAS and LAPACK by Netlib installed by default, but in my experiments Netlib BLAS was slower than Eigen internal BLAS)
  • From several free options, OpenBLAS was the fastest in my experiments
  • Intel MKL was sometimes slightly faster, but it’s non-free
  • Add following lines to CmdStan make/local
CXXFLAGS += -march=native -mtune=native -DEIGEN_USE_BLAS -DEIGEN_USE_LAPACKE
LDLIBS += -lblas -llapack -llapacke
  • compile
  • at least in some linux systems you can easily change which libraries are used for BLAS and LAPACK (e.g. I compared libraries by Netlib, OpenBLAS, BLIS, ATLAS, and Intel MKL)
sudo update-alternatives --config libblas.so.3-x86_64-linux-gnu
sudo update-alternatives --config liblapack.so.3-x86_64-linux-gnu
sudo update-alternatives --config liblapacke.so.3-x86_64-linux-gnu
  • By default none APT packages for OpenBLAS and MKL didn’t configure alternatives for liblapacle, which meant that MKL didn’t work at all for LAPACK functions, and OpenBLAS probably underperformed. Eventually I installed OpenBLAS from source and configured alternatives for liblapacke, too.
  • e.g. OpenBLAS and Intel MKL can use more than one thread also for BLAS/LAPACK. You can set the number of threads with environment variables OPENBLAS_NUM_THREADS and MKL_NUM_THREADS. Or if 1 thread is good, you can choose OpenBLAS-serial library.

If you are using cmdstanr, you can modify the file make/local and rebuild CmdStan from R:

cpp_options = list("CXXFLAGS += -march=native -mtune=native -DEIGEN_USE_BLAS -DEIGEN_USE_LAPACKE", "LDLIBS += -lblas -llapack -llapacke")
cmdstanr::cmdstan_make_local(cpp_options = cpp_options, append = TRUE)
cmdstanr::rebuild_cmdstan(cores = 4)

You can use Intel MKL with the above approach, but might get a bit more speedup using direct calls and vectorized functions, although I didn’t see much difference to OpenBLAS. To get the additional MKL features, add for example the following (where I intentionally chose sequential ie no parallel threads). With this approach you lose the easy way to switch (but you could have different CmdStan versions in different directories).

CXXFLAGS += -DEIGEN_USE_MKL_ALL -I"/usr/include/mkl"
LDLIBS += -lmkl_intel_lp64 -lmkl_sequential -lmkl_core

Note: Intel MKL is a proprietary software and it is the responsibility of users to buy or register for community (free) Intel MKL licenses for their products. Moreover, the license of the user product has to allow linking to proprietary software that excludes any unmodified versions of the GPL.

More Linux (Ubuntu) specific information: I installed using Synaptic package manager (a graphical APT interface)

  • liblapacke-dev, liblapacke (lapacke required by Eigen to support external LAPACK)
  • libopenblas-dev, libopenblas-pthread-dev, libopenblas-serial-dev, libopenblas0, libopenblas0-pthread, libopenblas0-serial (I installed both pthread parallel and serial versions to be able to compare and as in some copmarison serial was faster than phtread with one thread)
  • intel-mkl, libmkl-dev, libmkl-threading-dev, libmk-sequential and everything synaptic recommended (20+ packages)
  • I’ll add later details on what I had to do, to get liblapacake part to work better.

*-dev packages are a bit misleadingly named as it sounds like you would need if you develop those packages or some other packages, but they are need always when compiling a program that calls the corresponding library, and as Stan models are compiled we need them, too.

I also installed BLIS and ATLAS, but they did perform worse than OpenBLAS.

More information

If you try this, please report results here.

EDIT1: added information about which APT packages I installed to get this working

28 Likes

Thanks a lot for this tutorial !

I’ve tried it with OpenBLAS (through CmdStanR), but I’m having some issues:

  1. Compilation fails due to not finding -llapacke (error: /bin/ld: cannot find -llapacke)
  2. If I remove this argument, it compiles successfully and my test model runs, but very slowly: I get ~4x slower sampling without within chain parallelization, and ~6x slower with parallelization.

Some potentially relevant info:

  • CmdStan 2.28.2
  • I’m using WSL2 on W11 (Ubuntu 20.04.3 LTS - GNU/Linux 5.10.60.1-microsoft-standard-WSL2 x86_64)
  • CPU is Ryzen 5950x
  • Other cpp_options I use are: list(STAN_THREADS = TRUE, PRECOMPILED_HEADERS = TRUE, STAN_CPP_OPTIMS = TRUE)

Disclaimer: Total noob at BLAS stuff, I have no idea what I’m doing.

Edit:

  • Model is a simple Bernouilli GLM with 10 rnorm() predictors, and I use brms to generate the stan code and data.
  • OpenBLAS was installed with sudo apt-get install libopenblas-dev
  • Both BLAS and LAPACK are set to OpenBLAS (/usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 and /usr/lib/x86_64-linux-gnu/openblas-pthread/liblapack.so.3)
  • I have not changed the default OPENBLAS_NUM_THREADS (I have no idea what it’s doing)
1 Like

You need to install also liblapacke (note the e in the end). It’s also available via apt-get. lapacke provides interface between Eigen and lapack libraries.

For brms it seems that it’s best to set OPENBLAS_NUM_THREADS=1. I’ve also installed openblas-serial, which always use just one thread. By default OPENBLAS_NUM_THREADS is 4 or 8 (I can’t remember) and if in addition you run 4 chains parallel, you get 16-32 threads which leads to high inefficiency.

1 Like

Ah, my bad, I thought this library was included in OpenBLAS.
I installed it, but I still get the same error at compilation. Maybe I need to provide liblapacke location in a way ?

LAPACK support is included, but not LAPACKE which is needed between Eigen and LAPACK. I will clarify this in the original post later today.

In my Ubuntu, the package when to the usual place, and I didn’t need to provide anything else. Without seeing what you included in make/local, and the specific error, it’s difficult to help more. Check that you have the correct number of l’s.

If it helps, here are a few screenshots of the cpp_options I use, plus the results of the “checks” you mentioned in your post.

Ah, I found the problem: I also needed to install liblapacke-dev (and potentially libopenblas-base, but I suspect the first one was the culprit). Thanks again for your help (I’m pretty new to Linux overall).

Gonna try to see if model runtime is better now.

1 Like

I’ve used Linux for a long time, but helps only a little when there are so many different packages that may need to be installed. I’ll add more details what I installed.

1 Like

please tell more. which models gave 40% speedups?

where do we need 40% speedups the most?

2 Likes

I should have been doing something else, but here goes

1. Introduction

  • different BLAS / LAPACK libraries introduce another non-reproducibility issue, as the computations are not bit by bit the same
  • To get MKL to work with, e.g. Cholesky, liblapacke needs to be replaced with MKL. Instead of at linking time selecting which library to use, Linux systems allow alternatives that can easily change which library is used to present common libraries. To get MKL and OpenBLAS to work better you need to set them as alternatives for liblapacke, which needs a bit of work. Or you can replace -llapacke with -openblas or -lmkl_rt at linking time.
  • to get a significant speed improvent, there needs to be big matrix operations, e.g. matrix-matrix-products (handled by BLAS) or Cholesky (or QR, SVD, etc, handled by LAPACK)
  • it is likely that the profiler can be used to find the cases where big matrix operations dominate the target computation
  • these are easy to test as this doesn’t require changes to the code (like sum_reduce parallelization requires)
  • I tested with several models, but nothing really slow that would take hours or days, but these already give a good idea when this is likely to be useful and when not.

2. normal linear regression with N=1e4, p=100

2.1. Stan code with y ~ normal(alpha + x*beta, sigma);

  • Eigen without march/mtune: 24s
  • Eigen: 20s
  • OpenBLAS: 17s
  • OpenBLAS w. 2 threads: 12s
  • MKL: 20s
  • MKL w. 2 threads: 16s

Small 15% time reduction using OpenBLAS, and with two threads 40% improvement. MKL doesn’t provide any time reduction and only 20% time reduction with two threads.

2.2. Stan code with y ~ normal_id_glm(x, alpha, beta, sigma);

  • Eigen without march/mtune: 20s
  • Eigen: 13s
  • OpenBLAS: 15s
  • OpenBLAS w. 2 threads: 7s
  • MKL: 12s
  • MKL w. 2 threads: 7s

OpenBLAS is slower than Eigen, but with two threads provides 46% time
reduction. MKL provides small 8% time reduction, but with two threads
provides 46% time reduction. Using parallel threads doesn’t need any
sum_reduce or any change in the code. The matrix is quite big 10000 x
100.

The great 2 thread performance was not observed with y ~ normal(alpha + x*beta, sigma);, so it seems that the best benefits will be got when the autodiff is not involved (what will happen with varmat?)

2.3. Stan code generated with brm(y ~ ., ...)

  • Eigen without march/mtune: 26s
  • Eigen: 14s
  • Eigen + sum_reduce w. 2 threads: 16s
  • OpenBLAS: 15s
  • OpenBLAS w. 2 threads: 6s
  • OpenBLAS + sum_reduce w. 2 threads: 12s
  • MKL: 16s
  • MKL w. 2 threads: 7s

brms speed is on par with hand coded code. Running with more sum_reduce threads deosn’t help, while using more threads in BLAS provides time reduction 50-60% time reduction.

3. GP with covariance matrix N=266

3.1. Stan code with L_f = cholesky_decompose(K_f); yn ~ multi_normal_cholesky(zeros, L_f);

  • Eigen without march/mtune: 95s
  • Eigen: 53s
  • OpenBLAS: 58s
  • OpenBLAS w. 2 threads: 59s
  • MKL: 51s
  • MKL w. 2 threads: 45s
  • MKL w. 3 threads: 43s

OpenBLAS provides 9% time increase. MKL provides negligible time reduction, unless more threads are used to get up to 19% time reduction.

3.2. Stan code with yn ~ multi_normal(zeros, K_f);

  • Eigen without march/mtune: 110s
  • Eigen: 59s
  • OpenBLAS: 56s
  • OpenBLAS w. 2 threads: 50s
  • MKL : 48s
  • MKL w. 2 threads: 44s
  • MKL w. 3 threads: 40s

First it is notable that when using Eigen, yn ~ multi_normal(zeros, K_f); is 11% slower than L_f = cholesky_decompose(K_f); yn ~ multi_normal_cholesky(zeros, L_f);

OpenBLAS provides 5% time reduction, MKL provides 19% time reduction. With more threads we can get up to 32% time reduction. The matrix is still quite small, and using more threads gives only a small additional time reduction.

A bigger time reduction was obtained when using a compound function (multi_normal vs cholesky_decompose + multi_normal_cholesky).

4. GP with covariance matrix N=2128

This is using BFGS which makes only 23 target evaluations to converge

4.1. Stan code with L_f = cholesky_decompose(K_f); yn ~ multi_normal_cholesky(zeros, L_f);

  • Eigen 19s
  • OpenBLAS 20s
  • OpenBLAS w. 2 threads 15s
  • OpenBLAS w. 3 threads 13s
  • MKL 19s
  • MKL w. 2 threads 14s
  • MKL w. 3 threads 12s

No difference between Eigen, OpenBLAS and MKL, but OpenBLAS and MKL
can provide 21-37% time reduction when using more threads.

4.2. Stan code with yn ~ multi_normal(zeros, K_f);

  • Eigen 32s
  • OpenBLAS 30s
  • OpenBLAS w. 2 threads 21s
  • OpenBLAS w. 3 threads 18s
  • MKL 30s
  • MKL w. 2 threads 21s
  • MKL w. 3 threads 18s

5. brms hierarchical heteroscedastic linear regression with N=11245, p=1, grouping x | ID

This brms model was suggested by Donald Williams

bf(rt ~ congruency + (congruency | x | ID), sigma ~ congruency + (congruency | x | ID))
  • Eigen without march/mtune: 260s
  • Eigen: 250s
  • Eigen + sum_reduce w. 2 threads: 123s
  • OpenBLAS: 236s
  • OpenBLAS w. 2 threads: 234s
  • OpenBLAS + sum_reduce w. 2 threads: 120s
  • OpenBLAS w. 2 threads + sum_reduce w. 2 threads 125s
  • MKL: 237s
  • MKL w. 2 threads : 234s
  • MKL + sum_reduce w. 2 threads: 124s

OpenBLAS and MKL provides 5% time reduction. Additional threads for OpenBLAS or MKL did not help. sum_reduce gives 50% improvement. This hierarchical model has only a vector (length 11245) times scalar product and couple for loops from 1 to N. It is natural that external BLAS/LAPACK don’t help, but sum_reduce works well.

6. Conclusion

Models with big matrix-vector or matrix-matrix products, or matrix decompositions can get significant time reductions. With one thread, -march=native -mtune=native should be always used when compiling CmdStan and models. For single thread, small time-reductions can be obtained with external BLAS/LAPACK, but for most people it’s not worth the trouble of installing and configuring. For those who have big matrix operations, the external packages can provide speedupds at least up to 60% by using more threads (probably more with better CPUs than what I have in my laptop). Eigen documentation also mentions possibility of multithreading within Eigen, but it’s with openmp, and I couldn’t get it working (while OpenBLAS and MKL support pthreads that worked without any additional tricks). Bigger time reductions are observed with compound functions, which hints that varmat could also get bigger time reductions than matvar.

EDIT1: typo fixes
EDIT2: added info what MKL is missing
EDIT3: updated the results with -march=native -mtune=native

12 Likes

Well, this explains why I’ve been seeing speed gains of around 30% under Linux using OpenBLAS compared to run time under Windows on the same system. I had assumed the discrepancy had something to do with more the fact that support for parallel computation in R under Windows is generally awful but I guess there’s a bit more to it than that.

How can we verify that Eigen C++ uses at least AVX256 instruction set?

FAQ Eigen vectorization

On the x86-64 architecture, SSE2 is generally enabled by default, but you can enable AVX and FMA for better performance

My make/local

CXXFLAGS+= -O3 -march=native -mtune=native

@avehtari What was the content of your make/local file when you did the tests with Eigen?

Maybe the question to ask is “how do I sanity check that my Stan model compiled on my machine with my compiler and my version of Eigen, Stan, BLAS, etc, has access to AVX instructions”?

One way to check could be to disassemble your compiled model program, and then search the assembly code to see if you can find any AVX instructions. Suppose your compiled model program is named mymodel. On linux you could run objdump -D mymodel > mymodel.s to write a file model.s containing the assembly code, then you could search the assembly code in model.s for AVX instructions.

Andriy Makukha on stack overflow kindly offers this huge awk one-liner that can be used to search:

Find most common AVX instructions (including scalar, including AVX2, AVX-512 family and some FMA like vfmadd132pd ):

awk '/[ \t](vmovapd|vmulpd|vaddpd|vsubpd|vfmadd213pd|vfmadd231pd|vfmadd132pd|vmulsd|vaddsd|vmosd|vsubsd|vbroadcastss|vbroadcastsd|vblendpd|vshufpd|vroundpd|vroundsd|vxorpd|vfnmadd231pd|vfnmadd213pd|vfnmadd132pd|vandpd|vmaxpd|vmovmskpd|vcmppd|vpaddd|vbroadcastf128|vinsertf128|vextractf128|vfmsub231pd|vfmsub132pd|vfmsub213pd|vmaskmovps|vmaskmovpd|vpermilps|vpermilpd|vperm2f128|vzeroall|vzeroupper|vpbroadcastb|vpbroadcastw|vpbroadcastd|vpbroadcastq|vbroadcasti128|vinserti128|vextracti128|vpminud|vpmuludq|vgatherdpd|vgatherqpd|vgatherdps|vgatherqps|vpgatherdd|vpgatherdq|vpgatherqd|vpgatherqq|vpmaskmovd|vpmaskmovq|vpermps|vpermd|vpermpd|vpermq|vperm2i128|vpblendd|vpsllvd|vpsllvq|vpsrlvd|vpsrlvq|vpsravd|vblendmpd|vblendmps|vpblendmd|vpblendmq|vpblendmb|vpblendmw|vpcmpd|vpcmpud|vpcmpq|vpcmpuq|vpcmpb|vpcmpub|vpcmpw|vpcmpuw|vptestmd|vptestmq|vptestnmd|vptestnmq|vptestmb|vptestmw|vptestnmb|vptestnmw|vcompresspd|vcompressps|vpcompressd|vpcompressq|vexpandpd|vexpandps|vpexpandd|vpexpandq|vpermb|vpermw|vpermt2b|vpermt2w|vpermi2pd|vpermi2ps|vpermi2d|vpermi2q|vpermi2b|vpermi2w|vpermt2ps|vpermt2pd|vpermt2d|vpermt2q|vshuff32x4|vshuff64x2|vshuffi32x4|vshuffi64x2|vpmultishiftqb|vpternlogd|vpternlogq|vpmovqd|vpmovsqd|vpmovusqd|vpmovqw|vpmovsqw|vpmovusqw|vpmovqb|vpmovsqb|vpmovusqb|vpmovdw|vpmovsdw|vpmovusdw|vpmovdb|vpmovsdb|vpmovusdb|vpmovwb|vpmovswb|vpmovuswb|vcvtps2udq|vcvtpd2udq|vcvttps2udq|vcvttpd2udq|vcvtss2usi|vcvtsd2usi|vcvttss2usi|vcvttsd2usi|vcvtps2qq|vcvtpd2qq|vcvtps2uqq|vcvtpd2uqq|vcvttps2qq|vcvttpd2qq|vcvttps2uqq|vcvttpd2uqq|vcvtudq2ps|vcvtudq2pd|vcvtusi2ps|vcvtusi2pd|vcvtusi2sd|vcvtusi2ss|vcvtuqq2ps|vcvtuqq2pd|vcvtqq2pd|vcvtqq2ps|vgetexppd|vgetexpps|vgetexpsd|vgetexpss|vgetmantpd|vgetmantps|vgetmantsd|vgetmantss|vfixupimmpd|vfixupimmps|vfixupimmsd|vfixupimmss|vrcp14pd|vrcp14ps|vrcp14sd|vrcp14ss|vrndscaleps|vrndscalepd|vrndscaless|vrndscalesd|vrsqrt14pd|vrsqrt14ps|vrsqrt14sd|vrsqrt14ss|vscalefps|vscalefpd|vscalefss|vscalefsd|valignd|valignq|vdbpsadbw|vpabsq|vpmaxsq|vpmaxuq|vpminsq|vpminuq|vprold|vprolvd|vprolq|vprolvq|vprord|vprorvd|vprorq|vprorvq|vpscatterdd|vpscatterdq|vpscatterqd|vpscatterqq|vscatterdps|vscatterdpd|vscatterqps|vscatterqpd|vpconflictd|vpconflictq|vplzcntd|vplzcntq|vpbroadcastmb2q|vpbroadcastmw2d|vexp2pd|vexp2ps|vrcp28pd|vrcp28ps|vrcp28sd|vrcp28ss|vrsqrt28pd|vrsqrt28ps|vrsqrt28sd|vrsqrt28ss|vgatherpf0dps|vgatherpf0qps|vgatherpf0dpd|vgatherpf0qpd|vgatherpf1dps|vgatherpf1qps|vgatherpf1dpd|vgatherpf1qpd|vscatterpf0dps|vscatterpf0qps|vscatterpf0dpd|vscatterpf0qpd|vscatterpf1dps|vscatterpf1qps|vscatterpf1dpd|vscatterpf1qpd|vfpclassps|vfpclasspd|vfpclassss|vfpclasssd|vrangeps|vrangepd|vrangess|vrangesd|vreduceps|vreducepd|vreducess|vreducesd|vpmovm2d|vpmovm2q|vpmovm2b|vpmovm2w|vpmovd2m|vpmovq2m|vpmovb2m|vpmovw2m|vpmullq|vpmadd52luq|vpmadd52huq|v4fmaddps|v4fmaddss|v4fnmaddps|v4fnmaddss|vp4dpwssd|vp4dpwssds|vpdpbusd|vpdpbusds|vpdpwssd|vpdpwssds|vpcompressb|vpcompressw|vpexpandb|vpexpandw|vpshld|vpshldv|vpshrd|vpshrdv|vpopcntd|vpopcntq|vpopcntb|vpopcntw|vpshufbitqmb|gf2p8affineinvqb|gf2p8affineqb|gf2p8mulb|vpclmulqdq|vaesdec|vaesdeclast|vaesenc|vaesenclast)[ \t]/' mymodel.s

– source: c++ - How to check if compiled code uses SSE and AVX instructions? - Stack Overflow

However, if you are really unlucky, there is also the possibility that your model program contains AVX instructions but the code doesn’t spend much or any time executing those instructions at runtime.

You could also measure to see what instructions in your program actually execute at runtime, especially inside hot loops. One way to confirm that the AVX instructions are actually executing on linux is using the perf profiling tool. You could do something like:

perf record ./mymodel put your usual args here
# wait for mymodel to finish running
perf report

Using perf to analyse what the CPU is doing is invasive and requires elevated permissions for security reasons. On ubuntu I have to run the following commands to temporarily grant my user elevated permissions before I am allowed to run perf:

sudo sysctl -w kernel.perf_event_paranoid=-1
sudo sysctl -w kernel.kptr_restrict=0

Once profiling information has been collected, perf report will show a summary of how much time is spent inside each function. You can zoom in to view each function call and perf will show a listing of assembly code (also annotated with the names of functions from the C / C++ code if your program was compiled with -g debugging information) and the time spent executing each assembly instruction along the margin. By default when you zoom in to view assembly code of a function, perf will focus the viewport on the section of assembly code where the most time is spent. Then you could see which operations are executing and if any of them match up with the huge list of AVX instructions above.

More generally, see Brendan Gregg’s reference: Linux perf Examples

4 Likes

Empty. I don’t remember it reported that any CXXFLAGS would give 30-50% time reduction, so I didn’t try to optimize that part.

I made the tests with Intel i7-10510U, which seems to have AVX2 (based on Intel website).

Thanks for reminding about these. I again forgot that march=native and mtune=native are not on by default, and they have a huge effect. The difference is now much smaller between Eigen internal and external BLAS/LAPACK. I’ll add the new results later. I think cmdstanr installer should ask if someone doesn’t want to use them, and by default use them as the speed difference is huge.

2 Likes

I’d suggest STAN_CPP_OPTIMS=true if you want to turn on most magic which @stevebronder once figured out to work ok. I think that includes the m* stuff.

Quick question: I see people adding those options in CmdStanR either as R code (i.e. list(STAN_CPP_OPTIMS = TRUE) ) or as strings with the cmake syntax (i.e. list("STAN_CPP_OPTIMS=true") ). Do both work ?

  • Where is this documented and how much speedup? I couldn’t find it in CmdStan User Guide’s chapter on Installation 1 CmdStan Installation | CmdStan User’s Guide
  • I looked in Makefile, and march ad mtune are not mentioned there. Most of the options mentioned are included in -O2, or gcc manual mentions that they may also make the program slower.
  • why it is CPP_? and not CXX_?
  • Is STAN_CPP_OPTIMS in different architectures?
  • It seems STAN_CPP_OPTIMS fails on Apple Silicon -fwhole-program-vtables in STAN_CPP_OPTIMS breaks linkage on Apple Silicon · Issue #1041 · stan-dev/cmdstan · GitHub
  • Anyway, we often tell people to install Stan using CmdStanR, and then those people don’t read any other installation guide, so suggestion to use any additonal options would be good. I think the march/mtune, which are enabling use of modern features of CPUs, are more important than the compilation optimization tweaks (and most useful optimizations are included in -O2). We’re talking here 30-50% reductions in compile time for enabling march/mtune.
1 Like

Should not matter. Either way works…just check the make/local file.

It‘s key to get people going…and to start with people solve smaller problems where this stuff does not matter.

Optimizations have many pitfalls…e.g. the mtune/march stuff is not a good idea on windows. We barely get the bulk of users running smoothly, optimizations are - to me - secondary.

Secondary in the sense that once you are hooked into stan, then you will find the optimizations…after all I drove reduce_sum…

1 Like