Speedup by using external BLAS/LAPACK with CmdStan and CmdStanR/Py

avehtari · November 30, 2021, 8:05am

Related to -march, -mtune, -mcpu

What I found

recent gcc x86 and Arm:
- -march=X: Tells the compiler that X is the minimal architecture the binary must run on. The compiler is free to use architecture-specific instructions. This option behaves differently on Arm and x86. On Arm, -march does not override -mtune, but on x86 -march does override both -mtune and -mcpu.
- -mtune=X: Tells the compiler to optimize for microarchitecture X, but does not allow the compiler to change the ABI or make assumptions about available instructions. This option has the more-or-less the same meaning on Arm and x86.
- -mcpu=X: On Arm, this option is a combination of -march and -mtune. It simultaneously specifies the target architecture and optimizes for a given microarchitecture. On x86, this option is a deprecated synonym for -mtune.
clang: starting from 12.0 -mtune works the same as in gcc
So based on the documentation of the recent compilers, -mtune=native should not use features that are not available, but…
… older compilers may have different behavior or may not recognize the specific CPU details and then using -mtune=native may fail, but -mtune=generic is likely to work. If only -march=native has defined, it usually implies -mtune=native, except when the compiler doesn’t recognize all the details, and switches to -mtune=generic, which seems to have happened in those Windows cases where just dropping -mtune=native did help.

So it’s likely that there would be less problems with newer compilers, but an interactive installation would allowask, e.g., options

Optimization:

Safe: Should work with all compilers and CPUs. [Default. Recommended for MINGW on Windows or if 2. doesn’t work]
Fast: 0-100% faster computation using CPU specific instruction sets, specially in case of bigger matrix operations, but the compilation may fail for some compiler-OS-CPU combinations. (CXXFLAGS += -march=native) [Recommended if 3. doesn’t work]
Faster: 0-100% faster computation using CPU specific instruction sets and CPU specific optimization, specially in case of bigger matrix operations, but the compilation may fail for some compiler-OS-CPU combinations. (CXXFLAGS += -march=native -mtune=native) [Recommended for GCC on Linux]

Threads:

Single thread: If you are not using reduce_sum or …? [Default]
Multithread: If you are using reduce_sum or …? (not needed for external BLAS/LAPACK multithreading]

BLAS/LAPACK:

Eigen internal: No need to install other packages. [Default. Recommended for most users.]
External BLAS/LAPACK: Possibly slightly faster single thread computation than with Eigen, and possibility to use multithreaded matrix operations by using external BLAS/LAPACK such as OpenBLAS or Intel MKL (CXXFLAGS += -DEIGEN_USE_BLAS -DEIGEN_USE_LAPACKE) [Recommended only for advanced users in case of slow computation dominated by big matrix operations]

EDIT: Minor edit + added this also to the issue Interactive installation · Issue #605 · stan-dev/cmdstanr · GitHub
EDIT2: fixed reduce_sum