Speedup by using external BLAS/LAPACK with CmdStan and CmdStanR/Py

LAPACK support is included, but not LAPACKE which is needed between Eigen and LAPACK. I will clarify this in the original post later today.

In my Ubuntu, the package when to the usual place, and I didn’t need to provide anything else. Without seeing what you included in make/local, and the specific error, it’s difficult to help more. Check that you have the correct number of l’s.

If it helps, here are a few screenshots of the cpp_options I use, plus the results of the “checks” you mentioned in your post.

Ah, I found the problem: I also needed to install liblapacke-dev (and potentially libopenblas-base, but I suspect the first one was the culprit). Thanks again for your help (I’m pretty new to Linux overall).

Gonna try to see if model runtime is better now.

1 Like

I’ve used Linux for a long time, but helps only a little when there are so many different packages that may need to be installed. I’ll add more details what I installed.

1 Like

please tell more. which models gave 40% speedups?

where do we need 40% speedups the most?

2 Likes

I should have been doing something else, but here goes

1. Introduction

  • different BLAS / LAPACK libraries introduce another non-reproducibility issue, as the computations are not bit by bit the same
  • To get MKL to work with, e.g. Cholesky, liblapacke needs to be replaced with MKL. Instead of at linking time selecting which library to use, Linux systems allow alternatives that can easily change which library is used to present common libraries. To get MKL and OpenBLAS to work better you need to set them as alternatives for liblapacke, which needs a bit of work. Or you can replace -llapacke with -openblas or -lmkl_rt at linking time.
  • to get a significant speed improvent, there needs to be big matrix operations, e.g. matrix-matrix-products (handled by BLAS) or Cholesky (or QR, SVD, etc, handled by LAPACK)
  • it is likely that the profiler can be used to find the cases where big matrix operations dominate the target computation
  • these are easy to test as this doesn’t require changes to the code (like sum_reduce parallelization requires)
  • I tested with several models, but nothing really slow that would take hours or days, but these already give a good idea when this is likely to be useful and when not.

2. normal linear regression with N=1e4, p=100

2.1. Stan code with y ~ normal(alpha + x*beta, sigma);

  • Eigen without march/mtune: 24s
  • Eigen: 20s
  • OpenBLAS: 17s
  • OpenBLAS w. 2 threads: 12s
  • MKL: 20s
  • MKL w. 2 threads: 16s

Small 15% time reduction using OpenBLAS, and with two threads 40% improvement. MKL doesn’t provide any time reduction and only 20% time reduction with two threads.

2.2. Stan code with y ~ normal_id_glm(x, alpha, beta, sigma);

  • Eigen without march/mtune: 20s
  • Eigen: 13s
  • OpenBLAS: 15s
  • OpenBLAS w. 2 threads: 7s
  • MKL: 12s
  • MKL w. 2 threads: 7s

OpenBLAS is slower than Eigen, but with two threads provides 46% time
reduction. MKL provides small 8% time reduction, but with two threads
provides 46% time reduction. Using parallel threads doesn’t need any
sum_reduce or any change in the code. The matrix is quite big 10000 x
100.

The great 2 thread performance was not observed with y ~ normal(alpha + x*beta, sigma);, so it seems that the best benefits will be got when the autodiff is not involved (what will happen with varmat?)

2.3. Stan code generated with brm(y ~ ., ...)

  • Eigen without march/mtune: 26s
  • Eigen: 14s
  • Eigen + sum_reduce w. 2 threads: 16s
  • OpenBLAS: 15s
  • OpenBLAS w. 2 threads: 6s
  • OpenBLAS + sum_reduce w. 2 threads: 12s
  • MKL: 16s
  • MKL w. 2 threads: 7s

brms speed is on par with hand coded code. Running with more sum_reduce threads deosn’t help, while using more threads in BLAS provides time reduction 50-60% time reduction.

3. GP with covariance matrix N=266

3.1. Stan code with L_f = cholesky_decompose(K_f); yn ~ multi_normal_cholesky(zeros, L_f);

  • Eigen without march/mtune: 95s
  • Eigen: 53s
  • OpenBLAS: 58s
  • OpenBLAS w. 2 threads: 59s
  • MKL: 51s
  • MKL w. 2 threads: 45s
  • MKL w. 3 threads: 43s

OpenBLAS provides 9% time increase. MKL provides negligible time reduction, unless more threads are used to get up to 19% time reduction.

3.2. Stan code with yn ~ multi_normal(zeros, K_f);

  • Eigen without march/mtune: 110s
  • Eigen: 59s
  • OpenBLAS: 56s
  • OpenBLAS w. 2 threads: 50s
  • MKL : 48s
  • MKL w. 2 threads: 44s
  • MKL w. 3 threads: 40s

First it is notable that when using Eigen, yn ~ multi_normal(zeros, K_f); is 11% slower than L_f = cholesky_decompose(K_f); yn ~ multi_normal_cholesky(zeros, L_f);

OpenBLAS provides 5% time reduction, MKL provides 19% time reduction. With more threads we can get up to 32% time reduction. The matrix is still quite small, and using more threads gives only a small additional time reduction.

A bigger time reduction was obtained when using a compound function (multi_normal vs cholesky_decompose + multi_normal_cholesky).

4. GP with covariance matrix N=2128

This is using BFGS which makes only 23 target evaluations to converge

4.1. Stan code with L_f = cholesky_decompose(K_f); yn ~ multi_normal_cholesky(zeros, L_f);

  • Eigen 19s
  • OpenBLAS 20s
  • OpenBLAS w. 2 threads 15s
  • OpenBLAS w. 3 threads 13s
  • MKL 19s
  • MKL w. 2 threads 14s
  • MKL w. 3 threads 12s

No difference between Eigen, OpenBLAS and MKL, but OpenBLAS and MKL
can provide 21-37% time reduction when using more threads.

4.2. Stan code with yn ~ multi_normal(zeros, K_f);

  • Eigen 32s
  • OpenBLAS 30s
  • OpenBLAS w. 2 threads 21s
  • OpenBLAS w. 3 threads 18s
  • MKL 30s
  • MKL w. 2 threads 21s
  • MKL w. 3 threads 18s

5. brms hierarchical heteroscedastic linear regression with N=11245, p=1, grouping x | ID

This brms model was suggested by Donald Williams

bf(rt ~ congruency + (congruency | x | ID), sigma ~ congruency + (congruency | x | ID))
  • Eigen without march/mtune: 260s
  • Eigen: 250s
  • Eigen + sum_reduce w. 2 threads: 123s
  • OpenBLAS: 236s
  • OpenBLAS w. 2 threads: 234s
  • OpenBLAS + sum_reduce w. 2 threads: 120s
  • OpenBLAS w. 2 threads + sum_reduce w. 2 threads 125s
  • MKL: 237s
  • MKL w. 2 threads : 234s
  • MKL + sum_reduce w. 2 threads: 124s

OpenBLAS and MKL provides 5% time reduction. Additional threads for OpenBLAS or MKL did not help. sum_reduce gives 50% improvement. This hierarchical model has only a vector (length 11245) times scalar product and couple for loops from 1 to N. It is natural that external BLAS/LAPACK don’t help, but sum_reduce works well.

6. Conclusion

Models with big matrix-vector or matrix-matrix products, or matrix decompositions can get significant time reductions. With one thread, -march=native -mtune=native should be always used when compiling CmdStan and models. For single thread, small time-reductions can be obtained with external BLAS/LAPACK, but for most people it’s not worth the trouble of installing and configuring. For those who have big matrix operations, the external packages can provide speedupds at least up to 60% by using more threads (probably more with better CPUs than what I have in my laptop). Eigen documentation also mentions possibility of multithreading within Eigen, but it’s with openmp, and I couldn’t get it working (while OpenBLAS and MKL support pthreads that worked without any additional tricks). Bigger time reductions are observed with compound functions, which hints that varmat could also get bigger time reductions than matvar.

EDIT1: typo fixes
EDIT2: added info what MKL is missing
EDIT3: updated the results with -march=native -mtune=native

12 Likes

Well, this explains why I’ve been seeing speed gains of around 30% under Linux using OpenBLAS compared to run time under Windows on the same system. I had assumed the discrepancy had something to do with more the fact that support for parallel computation in R under Windows is generally awful but I guess there’s a bit more to it than that.

How can we verify that Eigen C++ uses at least AVX256 instruction set?

FAQ Eigen vectorization

On the x86-64 architecture, SSE2 is generally enabled by default, but you can enable AVX and FMA for better performance

My make/local

CXXFLAGS+= -O3 -march=native -mtune=native

@avehtari What was the content of your make/local file when you did the tests with Eigen?

Maybe the question to ask is “how do I sanity check that my Stan model compiled on my machine with my compiler and my version of Eigen, Stan, BLAS, etc, has access to AVX instructions”?

One way to check could be to disassemble your compiled model program, and then search the assembly code to see if you can find any AVX instructions. Suppose your compiled model program is named mymodel. On linux you could run objdump -D mymodel > mymodel.s to write a file model.s containing the assembly code, then you could search the assembly code in model.s for AVX instructions.

Andriy Makukha on stack overflow kindly offers this huge awk one-liner that can be used to search:

Find most common AVX instructions (including scalar, including AVX2, AVX-512 family and some FMA like vfmadd132pd ):

awk '/[ \t](vmovapd|vmulpd|vaddpd|vsubpd|vfmadd213pd|vfmadd231pd|vfmadd132pd|vmulsd|vaddsd|vmosd|vsubsd|vbroadcastss|vbroadcastsd|vblendpd|vshufpd|vroundpd|vroundsd|vxorpd|vfnmadd231pd|vfnmadd213pd|vfnmadd132pd|vandpd|vmaxpd|vmovmskpd|vcmppd|vpaddd|vbroadcastf128|vinsertf128|vextractf128|vfmsub231pd|vfmsub132pd|vfmsub213pd|vmaskmovps|vmaskmovpd|vpermilps|vpermilpd|vperm2f128|vzeroall|vzeroupper|vpbroadcastb|vpbroadcastw|vpbroadcastd|vpbroadcastq|vbroadcasti128|vinserti128|vextracti128|vpminud|vpmuludq|vgatherdpd|vgatherqpd|vgatherdps|vgatherqps|vpgatherdd|vpgatherdq|vpgatherqd|vpgatherqq|vpmaskmovd|vpmaskmovq|vpermps|vpermd|vpermpd|vpermq|vperm2i128|vpblendd|vpsllvd|vpsllvq|vpsrlvd|vpsrlvq|vpsravd|vblendmpd|vblendmps|vpblendmd|vpblendmq|vpblendmb|vpblendmw|vpcmpd|vpcmpud|vpcmpq|vpcmpuq|vpcmpb|vpcmpub|vpcmpw|vpcmpuw|vptestmd|vptestmq|vptestnmd|vptestnmq|vptestmb|vptestmw|vptestnmb|vptestnmw|vcompresspd|vcompressps|vpcompressd|vpcompressq|vexpandpd|vexpandps|vpexpandd|vpexpandq|vpermb|vpermw|vpermt2b|vpermt2w|vpermi2pd|vpermi2ps|vpermi2d|vpermi2q|vpermi2b|vpermi2w|vpermt2ps|vpermt2pd|vpermt2d|vpermt2q|vshuff32x4|vshuff64x2|vshuffi32x4|vshuffi64x2|vpmultishiftqb|vpternlogd|vpternlogq|vpmovqd|vpmovsqd|vpmovusqd|vpmovqw|vpmovsqw|vpmovusqw|vpmovqb|vpmovsqb|vpmovusqb|vpmovdw|vpmovsdw|vpmovusdw|vpmovdb|vpmovsdb|vpmovusdb|vpmovwb|vpmovswb|vpmovuswb|vcvtps2udq|vcvtpd2udq|vcvttps2udq|vcvttpd2udq|vcvtss2usi|vcvtsd2usi|vcvttss2usi|vcvttsd2usi|vcvtps2qq|vcvtpd2qq|vcvtps2uqq|vcvtpd2uqq|vcvttps2qq|vcvttpd2qq|vcvttps2uqq|vcvttpd2uqq|vcvtudq2ps|vcvtudq2pd|vcvtusi2ps|vcvtusi2pd|vcvtusi2sd|vcvtusi2ss|vcvtuqq2ps|vcvtuqq2pd|vcvtqq2pd|vcvtqq2ps|vgetexppd|vgetexpps|vgetexpsd|vgetexpss|vgetmantpd|vgetmantps|vgetmantsd|vgetmantss|vfixupimmpd|vfixupimmps|vfixupimmsd|vfixupimmss|vrcp14pd|vrcp14ps|vrcp14sd|vrcp14ss|vrndscaleps|vrndscalepd|vrndscaless|vrndscalesd|vrsqrt14pd|vrsqrt14ps|vrsqrt14sd|vrsqrt14ss|vscalefps|vscalefpd|vscalefss|vscalefsd|valignd|valignq|vdbpsadbw|vpabsq|vpmaxsq|vpmaxuq|vpminsq|vpminuq|vprold|vprolvd|vprolq|vprolvq|vprord|vprorvd|vprorq|vprorvq|vpscatterdd|vpscatterdq|vpscatterqd|vpscatterqq|vscatterdps|vscatterdpd|vscatterqps|vscatterqpd|vpconflictd|vpconflictq|vplzcntd|vplzcntq|vpbroadcastmb2q|vpbroadcastmw2d|vexp2pd|vexp2ps|vrcp28pd|vrcp28ps|vrcp28sd|vrcp28ss|vrsqrt28pd|vrsqrt28ps|vrsqrt28sd|vrsqrt28ss|vgatherpf0dps|vgatherpf0qps|vgatherpf0dpd|vgatherpf0qpd|vgatherpf1dps|vgatherpf1qps|vgatherpf1dpd|vgatherpf1qpd|vscatterpf0dps|vscatterpf0qps|vscatterpf0dpd|vscatterpf0qpd|vscatterpf1dps|vscatterpf1qps|vscatterpf1dpd|vscatterpf1qpd|vfpclassps|vfpclasspd|vfpclassss|vfpclasssd|vrangeps|vrangepd|vrangess|vrangesd|vreduceps|vreducepd|vreducess|vreducesd|vpmovm2d|vpmovm2q|vpmovm2b|vpmovm2w|vpmovd2m|vpmovq2m|vpmovb2m|vpmovw2m|vpmullq|vpmadd52luq|vpmadd52huq|v4fmaddps|v4fmaddss|v4fnmaddps|v4fnmaddss|vp4dpwssd|vp4dpwssds|vpdpbusd|vpdpbusds|vpdpwssd|vpdpwssds|vpcompressb|vpcompressw|vpexpandb|vpexpandw|vpshld|vpshldv|vpshrd|vpshrdv|vpopcntd|vpopcntq|vpopcntb|vpopcntw|vpshufbitqmb|gf2p8affineinvqb|gf2p8affineqb|gf2p8mulb|vpclmulqdq|vaesdec|vaesdeclast|vaesenc|vaesenclast)[ \t]/' mymodel.s

– source: c++ - How to check if compiled code uses SSE and AVX instructions? - Stack Overflow

However, if you are really unlucky, there is also the possibility that your model program contains AVX instructions but the code doesn’t spend much or any time executing those instructions at runtime.

You could also measure to see what instructions in your program actually execute at runtime, especially inside hot loops. One way to confirm that the AVX instructions are actually executing on linux is using the perf profiling tool. You could do something like:

perf record ./mymodel put your usual args here
# wait for mymodel to finish running
perf report

Using perf to analyse what the CPU is doing is invasive and requires elevated permissions for security reasons. On ubuntu I have to run the following commands to temporarily grant my user elevated permissions before I am allowed to run perf:

sudo sysctl -w kernel.perf_event_paranoid=-1
sudo sysctl -w kernel.kptr_restrict=0

Once profiling information has been collected, perf report will show a summary of how much time is spent inside each function. You can zoom in to view each function call and perf will show a listing of assembly code (also annotated with the names of functions from the C / C++ code if your program was compiled with -g debugging information) and the time spent executing each assembly instruction along the margin. By default when you zoom in to view assembly code of a function, perf will focus the viewport on the section of assembly code where the most time is spent. Then you could see which operations are executing and if any of them match up with the huge list of AVX instructions above.

More generally, see Brendan Gregg’s reference: Linux perf Examples

4 Likes

Empty. I don’t remember it reported that any CXXFLAGS would give 30-50% time reduction, so I didn’t try to optimize that part.

I made the tests with Intel i7-10510U, which seems to have AVX2 (based on Intel website).

Thanks for reminding about these. I again forgot that march=native and mtune=native are not on by default, and they have a huge effect. The difference is now much smaller between Eigen internal and external BLAS/LAPACK. I’ll add the new results later. I think cmdstanr installer should ask if someone doesn’t want to use them, and by default use them as the speed difference is huge.

2 Likes

I’d suggest STAN_CPP_OPTIMS=true if you want to turn on most magic which @stevebronder once figured out to work ok. I think that includes the m* stuff.

Quick question: I see people adding those options in CmdStanR either as R code (i.e. list(STAN_CPP_OPTIMS = TRUE) ) or as strings with the cmake syntax (i.e. list("STAN_CPP_OPTIMS=true") ). Do both work ?

  • Where is this documented and how much speedup? I couldn’t find it in CmdStan User Guide’s chapter on Installation 1 CmdStan Installation | CmdStan User’s Guide
  • I looked in Makefile, and march ad mtune are not mentioned there. Most of the options mentioned are included in -O2, or gcc manual mentions that they may also make the program slower.
  • why it is CPP_? and not CXX_?
  • Is STAN_CPP_OPTIMS in different architectures?
  • It seems STAN_CPP_OPTIMS fails on Apple Silicon -fwhole-program-vtables in STAN_CPP_OPTIMS breaks linkage on Apple Silicon · Issue #1041 · stan-dev/cmdstan · GitHub
  • Anyway, we often tell people to install Stan using CmdStanR, and then those people don’t read any other installation guide, so suggestion to use any additonal options would be good. I think the march/mtune, which are enabling use of modern features of CPUs, are more important than the compilation optimization tweaks (and most useful optimizations are included in -O2). We’re talking here 30-50% reductions in sampling time for enabling march/mtune.

EDIT: fixed compile time → sampling time

1 Like

Should not matter. Either way works…just check the make/local file.

It‘s key to get people going…and to start with people solve smaller problems where this stuff does not matter.

Optimizations have many pitfalls…e.g. the mtune/march stuff is not a good idea on windows. We barely get the bulk of users running smoothly, optimizations are - to me - secondary.

Secondary in the sense that once you are hooked into stan, then you will find the optimizations…after all I drove reduce_sum…

1 Like

With the Github development version, cmdstanr will warn you that you had flags in your make/local and you probably should copy them from your previous CmdStan install. But that obviously only works if you had added them at some point.

I am not sure on this one. I specifically remember a ton of rstan issues that were caused by march and mtune flags. Because rstan install instruction had those included and everyone just copy-pasted that. So I am not sure about being the default option, they should be more prominently suggested somewhere ( I don’t currently have a great idea where).

1 Like

Still, the information shouldn’t be hidden if the speed differences are that big. I suggested that install_cmdstan() could ask whether to use march/mtune, with information that using them can double the speed / drop the time by 50%, but if they have any issues they can reinstall and when asked for that option choose “safe mode”.

Can you tell more why Windows has problems with march?

reduce_sum is not a good comparison. That was adertised a lot in certain release, and brms uses it by default. No-one had to do any implementation work for march/mtune, it’s been available for a long time, getting all the time more important as CPUs get more new instruction sets, and itäs not mentioned anywhere in CmdStan installation.

1 Like

I have updated the first post and the post with the timing results to reflect use of -march=native -mtune=native

3 Likes

Specifically for CmdStanR, install_cmdstan() can easily be interactive and ask. And the CmdStan User Guide Instalaltion section should definitely discuss possible options.

1 Like