Stan uses Eigen for many matrix computations. Eigen has internal BLAS and LAPACK matrix computation routines. There is opportunity to get some speedup by using external libraries that provide BLAS and LAPACK routines. This speedup can be obtained without any changes in Stan code, but requires recompilation.
I tested with a few examples and got 0%-40% speedups using OpenBLAS.
How to enable external BLAS and LAPACK:
- check that you have some BLAS and LAPACK installed (linux systems are likely to have BLAS and LAPACK by Netlib installed by default, but in my experiments Netlib BLAS was slower than Eigen internal BLAS)
- From several free options, OpenBLAS was the fastest in my experiments
- Add following lines to CmdStan
CXXFLAGS += -DEIGEN_USE_BLAS -DEIGEN_USE_LAPACKE LDLIBS += -lblas -llapack -llapacke
- at least in some linux systems you can easily change which libraries are used for BLAS and LAPACK (e.g. I compared libraries by Netlib, OpenBLAS, BLIS, ATLAS, and Intel MKL, (although MKL failed as replacement for liblapack))
sudo update-alternatives --config libblas.so.3-x86_64-linux-gnu sudo update-alternatives --config liblapack.so.3-x86_64-linux-gnu
- e.g. OpenBLAS and Intel MKL can use more than one thread also for BLAS/LAPACK, but I got the best results running with 1 OpenBLAS or MKL thread. You can set the number of threads with environment variables OPENBLAS_NUM_THREADS and MKL_NUM_THREADS. Or if 1 thread is good, you can choose OpenBLAS-serial library.
If you are using cmdstanr, you can modify the file
make/local and rebuild CmdStan from R:
cpp_options = list("CXXFLAGS += -DEIGEN_USE_BLAS -DEIGEN_USE_LAPACKE", "LDLIBS += -lblas -llapack -llapacke") cmdstanr::cmdstan_make_local(cpp_options = cpp_options, append = TRUE) cmdstanr::rebuild_cmdstan(cores = 4)
You can use Intel MKL with the above approach, but might get a bit more speedup using direct calls and vectorized functions, although I didn’t see much difference to OpenBLAS. To get the additional MKL features, add for example the following (where I intentionally chose sequential ie no parallel threads). With this approach you lose the easy way to switch (but you could have different CmdStan versions in different directories).
CXXFLAGS += -DEIGEN_USE_MKL_ALL -I"/usr/include/mkl" LDLIBS += -lmkl_intel_lp64 -lmkl_sequential -lmkl_core
Note: Intel MKL is a proprietary software and it is the responsibility of users to buy or register for community (free) Intel MKL licenses for their products. Moreover, the license of the user product has to allow linking to proprietary software that excludes any unmodified versions of the GPL.
More Linux (Ubuntu) specific information: I installed using Synaptic package manager (a graphical APT interface)
- liblapacke-dev, liblapacke (lapacke required by Eigen to support external LAPACK)
- libopenblas-dev, libopenblas-pthread-dev, libopenblas-serial-dev, libopenblas0, libopenblas0-pthread, libopenblas0-serial (I installed both pthread parallel and serial versions to be able to compare and as in some copmarison serial was faster than phtread with one thread)
- intel-mkl, libmkl-dev, libmkl-threading-dev, libmk-sequential and everything synaptic recommended (20+ packages)
*-dev packages are a bit misleadingly named as it sounds like you would need if you develop those packages or some other packages, but they are need always when compiling a program that calls the corresponding library, and as Stan models are compiled we need them, too.
I also installed BLIS and ATLAS, but they did perform worse than OpenBLAS.
- Basic Linear Algebra Subprograms - Wikipedia
- LAPACK - Wikipedia
- Eigen: Using BLAS/LAPACK from Eigen
- Eigen: Using Intel® MKL from Eigen
- Link Line Advisor for Intel® oneAPI Math Kernel Library
If you try this, please report results here.
EDIT: added information about which APT packages I installed to get this working