Speedup by using external BLAS/LAPACK with CmdStan and CmdStanR/Py

Hi @Bob_Carpenter !

Ok… you are hooked into Stan, for real, yes. I just looked at the matter of things and we can improve for sure. The current STAN_CPP_OPTIMS flag is somewhat documented when you do a make help in the cmdstan directory. For me this turns on flags which break stuff during linking the basic Bernoulli program such that I would not recommend it for the moment being on macOS. Here is what I’d recommend you (I am on macOS here):

library(cmdstanr)


## brute force version which isn't entirerly clean:
cmdstan_make_local(cpp_options=list(STAN_THREADS=TRUE,
                                    ##STAN_CPP_OPTIMS=FALSE, ## this option must not be defined as it will be in effect otherwise at the moment... thats a non-feature needing a fix, so don't define it if you don't want it for now.
                                    STAN_NO_RANGE_CHECKS=TRUE,
                                    ## these optim variables are set by STAN_CPP_OPTIMS,
                                    ## but these are not working with my current Xcode from
                                    ## Jan 5th 2022
                                    ##CXXFLAGS_OPTIM=""
                                    ##CXXFLAGS_OPTIM_TBB="",
                                    ##CXXFLAGS_OPTIM_SUNDIALS=""
                                    ## brute-force make the tune stuff part of the compiler being called
                                    CXX="clang++ -mtune=native -march=native",
                                    CC="clang -mtune=native -march=native"
                                    ),
                   append=FALSE)

## cleaner version:
cmdstan_make_local(cpp_options=list(STAN_THREADS=TRUE,
                                    ##STAN_CPP_OPTIMS=FALSE, ## coupling this with the mtune/march fails for me
                                    STAN_NO_RANGE_CHECKS=TRUE,
                                    CXXFLAGS_OPTIM="-mtune=native -march=native",
                                    CXXFLAGS_OPTIM_TBB="-mtune=native -march=native",
                                    CXXFLAGS_OPTIM_SUNDIALS="-mtune=native -march=native"
                                    ),
                   append=FALSE)

rebuild_cmdstan(cores=4)

file <- file.path(cmdstan_path(), "examples", "bernoulli", "bernoulli.stan")

## let's check we get what we wanted (watch for the tune and arch settings)
mod <- cmdstan_model(file, force_recompile = TRUE, quiet=FALSE)

# names correspond to the data block in the Stan program
data_list <- list(N = 10, y = c(0,1,0,0,0,0,0,0,0,1))

fit <- mod$sample(
  data = data_list,
  seed = 123,
  chains = 4,
  parallel_chains = 4,
  refresh = 500
)

ping @stevebronder the macOS settings for the optims thing need an update as it does break stuff right now. We also need to ensure that the optimisations are only turned on whenever STAN_CPP_OPTIMS is set equal to true. Right now they kick in whenever you define that variable.

1 Like

Unfortunately, both CmdStan and CmdStanR documentation is lacking information about the default and potentially useful additional Makefile flags. There is also an issue for CmdStanR proposing to add provide interactive way to set them when installation is made in the interactive mode of R.

Linking to Intel MKL can be done using -qmkl with the latest versions of Intel compilers. I’d also add -xHost to the Intel compiler flags. If the model is well defined and not sensitive to the precision of division/sqrt, I’d add -fp-model fast=2 -no-prec-div -no-prec-sqrt. In general, I’d expect better performance with Intel compilers + MKL on some intel CPUs. That’s my current setup; additionally, I use the latest version of oneTBB as instructed at Math README.

1 Like

Hello, I have installed and set to use the OPENBLAS library for R.

> sessionInfo()
R version 4.1.2 (2021-11-01)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Linux Mint 20.2

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-serial/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-serial/liblapack.so.3

This is how I (re)build cmdstan:

cpp_options = list("STAN_CPP_OPTIMS=true")
cmdstanr::cmdstan_make_local(cpp_options = cpp_options, append = TRUE)
cmdstanr::rebuild_cmdstan(cores = 4)

Should I do anything differently or is this build quite optimized? Thanks.

1 Like

Sorry for missing this. I would have

cpp_options = list("CXXFLAGS += -march=native -mtune=native -DEIGEN_USE_BLAS -DEIGEN_USE_LAPACKE", "LDLIBS += -lblas -llapack -llapacke")

-march=native -mtune=native give a big speedup even without OpenBLAS, use of OpenBLAS requires the other parts (otherwise you just keep using Eigen’s internal BLAS routines). OpenBLAS has a clear benefit over Eigen’s internal BLAS only with big matrix operations and additional threads. One way to check that OpenBLAS is really used is to change the nevironment variable controllin how many threads OpenBLAS is using and look whether that really changes the number of threads run (using some process monitor like top)

5 Likes

Post withdrawn, decided to install linux.

1 Like

@avehtari

Could you tell me how to set OPENBLAS_NUM_THREADS=1? Is this variable set via cpp_options = list("OPENBLAS_NUM_THREADS=1", ...) from R side using cmdstanr’s functions, or did you set it from command line of OS? Or, is setting OPENBLAS_NUM_THREADS=1 unnecessary when I install libopenblas-serial-dev and libopenblas0-serial, which process data sequentially?

I used

Sys.setenv("OPENBLAS_NUM_THREADS=1")

Yes

1 Like

@avehtari

Thank you for your information!

One way to check that OpenBLAS is really used is to change the nevironment variable controllin how many threads OpenBLAS is using and look whether that really changes the number of threads run (using some process monitor like top)

Is the the only/main way to determine what BLAS cmdstan was compiled with? Is there any equivalent of R’s sessionInfo() that will dump the flags that stan was compiled with (and ideally explicitly what BLAS it’s using)? I found that from the cmdstan directory, I can run make help-dev or make compile_info. I’m assuming these are the flags that would be used if I decide to compile now, not what the current binary has (although in my case, these should be the same).

When I add the suggested C++ flags, I get this message when I run models with brms with cmdstanr backend, on Windows (latest R, latest RTools compilers, cmdstanr 0.8.1 and cmdstan 2.36.0

Do not specify '-march=native' in 'LOCAL_CPPFLAGS' or a Makevars file

is this normal? The model is like:

mlogit <- brms::brm(
  y ~ 1 + x + (1 + x | ID), family = "bernoulli",
  data = d, seed = 1234,
  silent = 2, refresh = 0,
  chains = 4L, cores = 4L, backend = "cmdstanr")

I followed the code for cmdstanr to add the flags

cpp_options = list("CXXFLAGS += -march=native -mtune=native -DEIGEN_USE_BLAS -DEIGEN_USE_LAPACKE", "LDLIBS += -lblas -llapack -llapacke")
cmdstanr::cmdstan_make_local(cpp_options = cpp_options, append = TRUE)
cmdstanr::rebuild_cmdstan(cores = 4)

Incidentally I get no such message when running the same code on Ubuntu.

We generally recommend against using -march=native on windows since it can cause models to crash. It’s a c++ optimisation flag unrelated to the external blas linking so you can safely omit it

1 Like