Stan uses Eigen for many matrix computations. Eigen has internal BLAS and LAPACK matrix computation routines. There is opportunity to get some speedup by using external libraries that provide BLAS and LAPACK routines. This speedup can be obtained without any changes in Stan code, but requires recompilation.
EDIT2: I had not realized CXXFLAGS += -march=native -mtune=native
are not enabled by default when compiling CmdStan, and most of the speed difference in single thread case is explained by that. Iāve updated the whole post and later reply with new results.
I tested with a few examples and got 0%-60% speedups using OpenBLAS and Intel MKL.
How to enable external BLAS and LAPACK:
- check that you have some BLAS and LAPACK installed (linux systems are likely to have BLAS and LAPACK by Netlib installed by default, but in my experiments Netlib BLAS was slower than Eigen internal BLAS)
- From several free options, OpenBLAS was the fastest in my experiments
- Intel MKL was sometimes slightly faster, but itās non-free
- Add following lines to CmdStan
make/local
CXXFLAGS += -march=native -mtune=native -DEIGEN_USE_BLAS -DEIGEN_USE_LAPACKE
LDLIBS += -lblas -llapack -llapacke
- compile
- at least in some linux systems you can easily change which libraries are used for BLAS and LAPACK (e.g. I compared libraries by Netlib, OpenBLAS, BLIS, ATLAS, and Intel MKL)
sudo update-alternatives --config libblas.so.3-x86_64-linux-gnu
sudo update-alternatives --config liblapack.so.3-x86_64-linux-gnu
sudo update-alternatives --config liblapacke.so.3-x86_64-linux-gnu
- By default none APT packages for OpenBLAS and MKL didnāt configure alternatives for liblapacle, which meant that MKL didnāt work at all for LAPACK functions, and OpenBLAS probably underperformed. Eventually I installed OpenBLAS from source and configured alternatives for liblapacke, too.
- e.g. OpenBLAS and Intel MKL can use more than one thread also for BLAS/LAPACK. You can set the number of threads with environment variables OPENBLAS_NUM_THREADS and MKL_NUM_THREADS. Or if 1 thread is good, you can choose OpenBLAS-serial library.
If you are using cmdstanr, you can modify the file make/local
and rebuild CmdStan from R:
cpp_options = list("CXXFLAGS += -march=native -mtune=native -DEIGEN_USE_BLAS -DEIGEN_USE_LAPACKE", "LDLIBS += -lblas -llapack -llapacke")
cmdstanr::cmdstan_make_local(cpp_options = cpp_options, append = TRUE)
cmdstanr::rebuild_cmdstan(cores = 4)
You can use Intel MKL with the above approach, but might get a bit more speedup using direct calls and vectorized functions, although I didnāt see much difference to OpenBLAS. To get the additional MKL features, add for example the following (where I intentionally chose sequential ie no parallel threads). With this approach you lose the easy way to switch (but you could have different CmdStan versions in different directories).
CXXFLAGS += -DEIGEN_USE_MKL_ALL -I"/usr/include/mkl"
LDLIBS += -lmkl_intel_lp64 -lmkl_sequential -lmkl_core
Note: Intel MKL is a proprietary software and it is the responsibility of users to buy or register for community (free) Intel MKL licenses for their products. Moreover, the license of the user product has to allow linking to proprietary software that excludes any unmodified versions of the GPL.
More Linux (Ubuntu) specific information: I installed using Synaptic package manager (a graphical APT interface)
- liblapacke-dev, liblapacke (lapacke required by Eigen to support external LAPACK)
- libopenblas-dev, libopenblas-pthread-dev, libopenblas-serial-dev, libopenblas0, libopenblas0-pthread, libopenblas0-serial (I installed both pthread parallel and serial versions to be able to compare and as in some copmarison serial was faster than phtread with one thread)
- intel-mkl, libmkl-dev, libmkl-threading-dev, libmk-sequential and everything synaptic recommended (20+ packages)
- Iāll add later details on what I had to do, to get liblapacake part to work better.
*-dev packages are a bit misleadingly named as it sounds like you would need if you develop those packages or some other packages, but they are need always when compiling a program that calls the corresponding library, and as Stan models are compiled we need them, too.
I also installed BLIS and ATLAS, but they did perform worse than OpenBLAS.
More information
- Basic Linear Algebra Subprograms - Wikipedia
- LAPACK - Wikipedia
- https://eigen.tuxfamily.org/dox/TopicUsingBlasLapack.html
- Eigen: Using IntelĀ® MKL from Eigen
- https://www.intel.com/content/www/us/en/developer/tools/oneapi/onemkl-link-line-advisor.html
If you try this, please report results here.
EDIT1: added information about which APT packages I installed to get this working