Speedup by using external BLAS/LAPACK with CmdStan and CmdStanR/Py

Thanks a lot for this tutorial !

I’ve tried it with OpenBLAS (through CmdStanR), but I’m having some issues:

  1. Compilation fails due to not finding -llapacke (error: /bin/ld: cannot find -llapacke)
  2. If I remove this argument, it compiles successfully and my test model runs, but very slowly: I get ~4x slower sampling without within chain parallelization, and ~6x slower with parallelization.

Some potentially relevant info:

  • CmdStan 2.28.2
  • I’m using WSL2 on W11 (Ubuntu 20.04.3 LTS - GNU/Linux 5.10.60.1-microsoft-standard-WSL2 x86_64)
  • CPU is Ryzen 5950x
  • Other cpp_options I use are: list(STAN_THREADS = TRUE, PRECOMPILED_HEADERS = TRUE, STAN_CPP_OPTIMS = TRUE)

Disclaimer: Total noob at BLAS stuff, I have no idea what I’m doing.

Edit:

  • Model is a simple Bernouilli GLM with 10 rnorm() predictors, and I use brms to generate the stan code and data.
  • OpenBLAS was installed with sudo apt-get install libopenblas-dev
  • Both BLAS and LAPACK are set to OpenBLAS (/usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 and /usr/lib/x86_64-linux-gnu/openblas-pthread/liblapack.so.3)
  • I have not changed the default OPENBLAS_NUM_THREADS (I have no idea what it’s doing)
1 Like