Speedup by using external BLAS/LAPACK with CmdStan and CmdStanR/Py

Related to -march, -mtune, -mcpu

What I found

  • recent gcc x86 and Arm:
    • -march=X: Tells the compiler that X is the minimal architecture the binary must run on. The compiler is free to use architecture-specific instructions. This option behaves differently on Arm and x86. On Arm, -march does not override -mtune, but on x86 -march does override both -mtune and -mcpu.
    • -mtune=X: Tells the compiler to optimize for microarchitecture X, but does not allow the compiler to change the ABI or make assumptions about available instructions. This option has the more-or-less the same meaning on Arm and x86.
    • -mcpu=X: On Arm, this option is a combination of -march and -mtune. It simultaneously specifies the target architecture and optimizes for a given microarchitecture. On x86, this option is a deprecated synonym for -mtune.
  • clang: starting from 12.0 -mtune works the same as in gcc
  • So based on the documentation of the recent compilers, -mtune=native should not use features that are not available, but…
  • … older compilers may have different behavior or may not recognize the specific CPU details and then using -mtune=native may fail, but -mtune=generic is likely to work. If only -march=native has defined, it usually implies -mtune=native, except when the compiler doesn’t recognize all the details, and switches to -mtune=generic, which seems to have happened in those Windows cases where just dropping -mtune=native did help.

So it’s likely that there would be less problems with newer compilers, but an interactive installation would allowask, e.g., options

Optimization:

  1. Safe: Should work with all compilers and CPUs. [Default. Recommended for MINGW on Windows or if 2. doesn’t work]
  2. Fast: 0-100% faster computation using CPU specific instruction sets, specially in case of bigger matrix operations, but the compilation may fail for some compiler-OS-CPU combinations. (CXXFLAGS += -march=native) [Recommended if 3. doesn’t work]
  3. Faster: 0-100% faster computation using CPU specific instruction sets and CPU specific optimization, specially in case of bigger matrix operations, but the compilation may fail for some compiler-OS-CPU combinations. (CXXFLAGS += -march=native -mtune=native) [Recommended for GCC on Linux]

Threads:

  1. Single thread: If you are not using reduce_sum or …? [Default]
  2. Multithread: If you are using reduce_sum or …? (not needed for external BLAS/LAPACK multithreading]

BLAS/LAPACK:

  1. Eigen internal: No need to install other packages. [Default. Recommended for most users.]
  2. External BLAS/LAPACK: Possibly slightly faster single thread computation than with Eigen, and possibility to use multithreaded matrix operations by using external BLAS/LAPACK such as OpenBLAS or Intel MKL (CXXFLAGS += -DEIGEN_USE_BLAS -DEIGEN_USE_LAPACKE) [Recommended only for advanced users in case of slow computation dominated by big matrix operations]

EDIT: Minor edit + added this also to the issue Interactive installation · Issue #605 · stan-dev/cmdstanr · GitHub
EDIT2: fixed reduce_sum

6 Likes