Speedup by using external BLAS/LAPACK with CmdStan and CmdStanR/Py

avehtari · November 25, 2021, 7:21pm

Stan uses Eigen for many matrix computations. Eigen has internal BLAS and LAPACK matrix computation routines. There is opportunity to get some speedup by using external libraries that provide BLAS and LAPACK routines. This speedup can be obtained without any changes in Stan code, but requires recompilation.

EDIT2: I had not realized CXXFLAGS += -march=native -mtune=native are not enabled by default when compiling CmdStan, and most of the speed difference in single thread case is explained by that. I’ve updated the whole post and later reply with new results.

I tested with a few examples and got 0%-60% speedups using OpenBLAS and Intel MKL.

How to enable external BLAS and LAPACK:

check that you have some BLAS and LAPACK installed (linux systems are likely to have BLAS and LAPACK by Netlib installed by default, but in my experiments Netlib BLAS was slower than Eigen internal BLAS)
From several free options, OpenBLAS was the fastest in my experiments
Intel MKL was sometimes slightly faster, but it’s non-free
Add following lines to CmdStan make/local

CXXFLAGS += -march=native -mtune=native -DEIGEN_USE_BLAS -DEIGEN_USE_LAPACKE
LDLIBS += -lblas -llapack -llapacke

compile
at least in some linux systems you can easily change which libraries are used for BLAS and LAPACK (e.g. I compared libraries by Netlib, OpenBLAS, BLIS, ATLAS, and Intel MKL)

sudo update-alternatives --config libblas.so.3-x86_64-linux-gnu
sudo update-alternatives --config liblapack.so.3-x86_64-linux-gnu
sudo update-alternatives --config liblapacke.so.3-x86_64-linux-gnu

By default none APT packages for OpenBLAS and MKL didn’t configure alternatives for liblapacle, which meant that MKL didn’t work at all for LAPACK functions, and OpenBLAS probably underperformed. Eventually I installed OpenBLAS from source and configured alternatives for liblapacke, too.
e.g. OpenBLAS and Intel MKL can use more than one thread also for BLAS/LAPACK. You can set the number of threads with environment variables OPENBLAS_NUM_THREADS and MKL_NUM_THREADS. Or if 1 thread is good, you can choose OpenBLAS-serial library.

If you are using cmdstanr, you can modify the file make/local and rebuild CmdStan from R:

cpp_options = list("CXXFLAGS += -march=native -mtune=native -DEIGEN_USE_BLAS -DEIGEN_USE_LAPACKE", "LDLIBS += -lblas -llapack -llapacke")
cmdstanr::cmdstan_make_local(cpp_options = cpp_options, append = TRUE)
cmdstanr::rebuild_cmdstan(cores = 4)

You can use Intel MKL with the above approach, but might get a bit more speedup using direct calls and vectorized functions, although I didn’t see much difference to OpenBLAS. To get the additional MKL features, add for example the following (where I intentionally chose sequential ie no parallel threads). With this approach you lose the easy way to switch (but you could have different CmdStan versions in different directories).

CXXFLAGS += -DEIGEN_USE_MKL_ALL -I"/usr/include/mkl"
LDLIBS += -lmkl_intel_lp64 -lmkl_sequential -lmkl_core

Note: Intel MKL is a proprietary software and it is the responsibility of users to buy or register for community (free) Intel MKL licenses for their products. Moreover, the license of the user product has to allow linking to proprietary software that excludes any unmodified versions of the GPL.

More Linux (Ubuntu) specific information: I installed using Synaptic package manager (a graphical APT interface)

liblapacke-dev, liblapacke (lapacke required by Eigen to support external LAPACK)
libopenblas-dev, libopenblas-pthread-dev, libopenblas-serial-dev, libopenblas0, libopenblas0-pthread, libopenblas0-serial (I installed both pthread parallel and serial versions to be able to compare and as in some copmarison serial was faster than phtread with one thread)
intel-mkl, libmkl-dev, libmkl-threading-dev, libmk-sequential and everything synaptic recommended (20+ packages)
I’ll add later details on what I had to do, to get liblapacake part to work better.

*-dev packages are a bit misleadingly named as it sounds like you would need if you develop those packages or some other packages, but they are need always when compiling a program that calls the corresponding library, and as Stan models are compiled we need them, too.

I also installed BLIS and ATLAS, but they did perform worse than OpenBLAS.

More information

If you try this, please report results here.

EDIT1: added information about which APT packages I installed to get this working

ma-riviere · November 26, 2021, 11:20am

Thanks a lot for this tutorial !

I’ve tried it with OpenBLAS (through CmdStanR), but I’m having some issues:

Compilation fails due to not finding -llapacke (error: /bin/ld: cannot find -llapacke)
If I remove this argument, it compiles successfully and my test model runs, but very slowly: I get ~4x slower sampling without within chain parallelization, and ~6x slower with parallelization.

Some potentially relevant info:

CmdStan 2.28.2
I’m using WSL2 on W11 (Ubuntu 20.04.3 LTS - GNU/Linux 5.10.60.1-microsoft-standard-WSL2 x86_64)
CPU is Ryzen 5950x
Other cpp_options I use are: list(STAN_THREADS = TRUE, PRECOMPILED_HEADERS = TRUE, STAN_CPP_OPTIMS = TRUE)

Disclaimer: Total noob at BLAS stuff, I have no idea what I’m doing.

Edit:

Model is a simple Bernouilli GLM with 10 rnorm() predictors, and I use brms to generate the stan code and data.
OpenBLAS was installed with sudo apt-get install libopenblas-dev
Both BLAS and LAPACK are set to OpenBLAS (/usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 and /usr/lib/x86_64-linux-gnu/openblas-pthread/liblapack.so.3)
I have not changed the default OPENBLAS_NUM_THREADS (I have no idea what it’s doing)

avehtari · November 26, 2021, 11:59am

You need to install also liblapacke (note the e in the end). It’s also available via apt-get. lapacke provides interface between Eigen and lapack libraries.

For brms it seems that it’s best to set OPENBLAS_NUM_THREADS=1. I’ve also installed openblas-serial, which always use just one thread. By default OPENBLAS_NUM_THREADS is 4 or 8 (I can’t remember) and if in addition you run 4 chains parallel, you get 16-32 threads which leads to high inefficiency.

ma-riviere · November 26, 2021, 12:42pm

Ah, my bad, I thought this library was included in OpenBLAS.
I installed it, but I still get the same error at compilation. Maybe I need to provide liblapacke location in a way ?

avehtari · November 26, 2021, 12:56pm

LAPACK support is included, but not LAPACKE which is needed between Eigen and LAPACK. I will clarify this in the original post later today.

In my Ubuntu, the package when to the usual place, and I didn’t need to provide anything else. Without seeing what you included in make/local, and the specific error, it’s difficult to help more. Check that you have the correct number of l’s.

ma-riviere · November 26, 2021, 1:13pm

If it helps, here are a few screenshots of the cpp_options I use, plus the results of the “checks” you mentioned in your post.

ma-riviere · November 26, 2021, 1:28pm

Ah, I found the problem: I also needed to install liblapacke-dev (and potentially libopenblas-base, but I suspect the first one was the culprit). Thanks again for your help (I’m pretty new to Linux overall).

Gonna try to see if model runtime is better now.

avehtari · November 26, 2021, 1:35pm

I’ve used Linux for a long time, but helps only a little when there are so many different packages that may need to be installed. I’ll add more details what I installed.

mitzimorris · November 26, 2021, 5:51pm

please tell more. which models gave 40% speedups?

where do we need 40% speedups the most?

avehtari · November 26, 2021, 6:59pm

I should have been doing something else, but here goes

1. Introduction

different BLAS / LAPACK libraries introduce another non-reproducibility issue, as the computations are not bit by bit the same
To get MKL to work with, e.g. Cholesky, liblapacke needs to be replaced with MKL. Instead of at linking time selecting which library to use, Linux systems allow alternatives that can easily change which library is used to present common libraries. To get MKL and OpenBLAS to work better you need to set them as alternatives for liblapacke, which needs a bit of work. Or you can replace -llapacke with -openblas or -lmkl_rt at linking time.
to get a significant speed improvent, there needs to be big matrix operations, e.g. matrix-matrix-products (handled by BLAS) or Cholesky (or QR, SVD, etc, handled by LAPACK)
it is likely that the profiler can be used to find the cases where big matrix operations dominate the target computation
these are easy to test as this doesn’t require changes to the code (like sum_reduce parallelization requires)
I tested with several models, but nothing really slow that would take hours or days, but these already give a good idea when this is likely to be useful and when not.

2. normal linear regression with N=1e4, p=100

2.1. Stan code with `y ~ normal(alpha + x*beta, sigma);`

Eigen without march/mtune: 24s
Eigen: 20s
OpenBLAS: 17s
OpenBLAS w. 2 threads: 12s
MKL: 20s
MKL w. 2 threads: 16s

Small 15% time reduction using OpenBLAS, and with two threads 40% improvement. MKL doesn’t provide any time reduction and only 20% time reduction with two threads.

2.2. Stan code with `y ~ normal_id_glm(x, alpha, beta, sigma);`

Eigen without march/mtune: 20s
Eigen: 13s
OpenBLAS: 15s
OpenBLAS w. 2 threads: 7s
MKL: 12s
MKL w. 2 threads: 7s

OpenBLAS is slower than Eigen, but with two threads provides 46% time
reduction. MKL provides small 8% time reduction, but with two threads
provides 46% time reduction. Using parallel threads doesn’t need any
sum_reduce or any change in the code. The matrix is quite big 10000 x
100.

The great 2 thread performance was not observed with y ~ normal(alpha + x*beta, sigma);, so it seems that the best benefits will be got when the autodiff is not involved (what will happen with varmat?)

2.3. Stan code generated with `brm(y ~ ., ...)`

Eigen without march/mtune: 26s
Eigen: 14s
Eigen + sum_reduce w. 2 threads: 16s
OpenBLAS: 15s
OpenBLAS w. 2 threads: 6s
OpenBLAS + sum_reduce w. 2 threads: 12s
MKL: 16s
MKL w. 2 threads: 7s

brms speed is on par with hand coded code. Running with more sum_reduce threads deosn’t help, while using more threads in BLAS provides time reduction 50-60% time reduction.

3. GP with covariance matrix N=266

3.1. Stan code with `L_f = cholesky_decompose(K_f); yn ~ multi_normal_cholesky(zeros, L_f);`

Eigen without march/mtune: 95s
Eigen: 53s
OpenBLAS: 58s
OpenBLAS w. 2 threads: 59s
MKL: 51s
MKL w. 2 threads: 45s
MKL w. 3 threads: 43s

OpenBLAS provides 9% time increase. MKL provides negligible time reduction, unless more threads are used to get up to 19% time reduction.

3.2. Stan code with `yn ~ multi_normal(zeros, K_f);`

Eigen without march/mtune: 110s
Eigen: 59s
OpenBLAS: 56s
OpenBLAS w. 2 threads: 50s
MKL : 48s
MKL w. 2 threads: 44s
MKL w. 3 threads: 40s

First it is notable that when using Eigen, yn ~ multi_normal(zeros, K_f); is 11% slower than L_f = cholesky_decompose(K_f); yn ~ multi_normal_cholesky(zeros, L_f);

OpenBLAS provides 5% time reduction, MKL provides 19% time reduction. With more threads we can get up to 32% time reduction. The matrix is still quite small, and using more threads gives only a small additional time reduction.

A bigger time reduction was obtained when using a compound function (multi_normal vs cholesky_decompose + multi_normal_cholesky).

4. GP with covariance matrix N=2128

This is using BFGS which makes only 23 target evaluations to converge

4.1. Stan code with `L_f = cholesky_decompose(K_f); yn ~ multi_normal_cholesky(zeros, L_f);`

Eigen 19s
OpenBLAS 20s
OpenBLAS w. 2 threads 15s
OpenBLAS w. 3 threads 13s
MKL 19s
MKL w. 2 threads 14s
MKL w. 3 threads 12s

No difference between Eigen, OpenBLAS and MKL, but OpenBLAS and MKL
can provide 21-37% time reduction when using more threads.

4.2. Stan code with `yn ~ multi_normal(zeros, K_f);`

Eigen 32s
OpenBLAS 30s
OpenBLAS w. 2 threads 21s
OpenBLAS w. 3 threads 18s
MKL 30s
MKL w. 2 threads 21s
MKL w. 3 threads 18s

5. brms hierarchical heteroscedastic linear regression with N=11245, p=1, grouping x | ID

This brms model was suggested by Donald Williams

bf(rt ~ congruency + (congruency | x | ID), sigma ~ congruency + (congruency | x | ID))

Eigen without march/mtune: 260s
Eigen: 250s
Eigen + sum_reduce w. 2 threads: 123s
OpenBLAS: 236s
OpenBLAS w. 2 threads: 234s
OpenBLAS + sum_reduce w. 2 threads: 120s
OpenBLAS w. 2 threads + sum_reduce w. 2 threads 125s
MKL: 237s
MKL w. 2 threads : 234s
MKL + sum_reduce w. 2 threads: 124s

OpenBLAS and MKL provides 5% time reduction. Additional threads for OpenBLAS or MKL did not help. sum_reduce gives 50% improvement. This hierarchical model has only a vector (length 11245) times scalar product and couple for loops from 1 to N. It is natural that external BLAS/LAPACK don’t help, but sum_reduce works well.

6. Conclusion

Models with big matrix-vector or matrix-matrix products, or matrix decompositions can get significant time reductions. With one thread, -march=native -mtune=native should be always used when compiling CmdStan and models. For single thread, small time-reductions can be obtained with external BLAS/LAPACK, but for most people it’s not worth the trouble of installing and configuring. For those who have big matrix operations, the external packages can provide speedupds at least up to 60% by using more threads (probably more with better CPUs than what I have in my laptop). Eigen documentation also mentions possibility of multithreading within Eigen, but it’s with openmp, and I couldn’t get it working (while OpenBLAS and MKL support pthreads that worked without any additional tricks). Bigger time reductions are observed with compound functions, which hints that varmat could also get bigger time reductions than matvar.

EDIT1: typo fixes
EDIT2: added info what MKL is missing
EDIT3: updated the results with -march=native -mtune=native

ajnafa · November 28, 2021, 5:32am

Well, this explains why I’ve been seeing speed gains of around 30% under Linux using OpenBLAS compared to run time under Windows on the same system. I had assumed the discrepancy had something to do with more the fact that support for parallel computation in R under Windows is generally awful but I guess there’s a bit more to it than that.

andre.pfeuffer · November 28, 2021, 12:31pm

How can we verify that Eigen C++ uses at least AVX256 instruction set?

FAQ Eigen vectorization

On the x86-64 architecture, SSE2 is generally enabled by default, but you can enable AVX and FMA for better performance

My make/local

CXXFLAGS+= -O3 -march=native -mtune=native

@avehtari What was the content of your make/local file when you did the tests with Eigen?

rfc · November 28, 2021, 6:48pm

Maybe the question to ask is “how do I sanity check that my Stan model compiled on my machine with my compiler and my version of Eigen, Stan, BLAS, etc, has access to AVX instructions”?

One way to check could be to disassemble your compiled model program, and then search the assembly code to see if you can find any AVX instructions. Suppose your compiled model program is named mymodel. On linux you could run objdump -D mymodel > mymodel.s to write a file model.s containing the assembly code, then you could search the assembly code in model.s for AVX instructions.

Andriy Makukha on stack overflow kindly offers this huge awk one-liner that can be used to search:

Find most common AVX instructions (including scalar, including AVX2, AVX-512 family and some FMA like vfmadd132pd ):

awk '/[ \t](vmovapd|vmulpd|vaddpd|vsubpd|vfmadd213pd|vfmadd231pd|vfmadd132pd|vmulsd|vaddsd|vmosd|vsubsd|vbroadcastss|vbroadcastsd|vblendpd|vshufpd|vroundpd|vroundsd|vxorpd|vfnmadd231pd|vfnmadd213pd|vfnmadd132pd|vandpd|vmaxpd|vmovmskpd|vcmppd|vpaddd|vbroadcastf128|vinsertf128|vextractf128|vfmsub231pd|vfmsub132pd|vfmsub213pd|vmaskmovps|vmaskmovpd|vpermilps|vpermilpd|vperm2f128|vzeroall|vzeroupper|vpbroadcastb|vpbroadcastw|vpbroadcastd|vpbroadcastq|vbroadcasti128|vinserti128|vextracti128|vpminud|vpmuludq|vgatherdpd|vgatherqpd|vgatherdps|vgatherqps|vpgatherdd|vpgatherdq|vpgatherqd|vpgatherqq|vpmaskmovd|vpmaskmovq|vpermps|vpermd|vpermpd|vpermq|vperm2i128|vpblendd|vpsllvd|vpsllvq|vpsrlvd|vpsrlvq|vpsravd|vblendmpd|vblendmps|vpblendmd|vpblendmq|vpblendmb|vpblendmw|vpcmpd|vpcmpud|vpcmpq|vpcmpuq|vpcmpb|vpcmpub|vpcmpw|vpcmpuw|vptestmd|vptestmq|vptestnmd|vptestnmq|vptestmb|vptestmw|vptestnmb|vptestnmw|vcompresspd|vcompressps|vpcompressd|vpcompressq|vexpandpd|vexpandps|vpexpandd|vpexpandq|vpermb|vpermw|vpermt2b|vpermt2w|vpermi2pd|vpermi2ps|vpermi2d|vpermi2q|vpermi2b|vpermi2w|vpermt2ps|vpermt2pd|vpermt2d|vpermt2q|vshuff32x4|vshuff64x2|vshuffi32x4|vshuffi64x2|vpmultishiftqb|vpternlogd|vpternlogq|vpmovqd|vpmovsqd|vpmovusqd|vpmovqw|vpmovsqw|vpmovusqw|vpmovqb|vpmovsqb|vpmovusqb|vpmovdw|vpmovsdw|vpmovusdw|vpmovdb|vpmovsdb|vpmovusdb|vpmovwb|vpmovswb|vpmovuswb|vcvtps2udq|vcvtpd2udq|vcvttps2udq|vcvttpd2udq|vcvtss2usi|vcvtsd2usi|vcvttss2usi|vcvttsd2usi|vcvtps2qq|vcvtpd2qq|vcvtps2uqq|vcvtpd2uqq|vcvttps2qq|vcvttpd2qq|vcvttps2uqq|vcvttpd2uqq|vcvtudq2ps|vcvtudq2pd|vcvtusi2ps|vcvtusi2pd|vcvtusi2sd|vcvtusi2ss|vcvtuqq2ps|vcvtuqq2pd|vcvtqq2pd|vcvtqq2ps|vgetexppd|vgetexpps|vgetexpsd|vgetexpss|vgetmantpd|vgetmantps|vgetmantsd|vgetmantss|vfixupimmpd|vfixupimmps|vfixupimmsd|vfixupimmss|vrcp14pd|vrcp14ps|vrcp14sd|vrcp14ss|vrndscaleps|vrndscalepd|vrndscaless|vrndscalesd|vrsqrt14pd|vrsqrt14ps|vrsqrt14sd|vrsqrt14ss|vscalefps|vscalefpd|vscalefss|vscalefsd|valignd|valignq|vdbpsadbw|vpabsq|vpmaxsq|vpmaxuq|vpminsq|vpminuq|vprold|vprolvd|vprolq|vprolvq|vprord|vprorvd|vprorq|vprorvq|vpscatterdd|vpscatterdq|vpscatterqd|vpscatterqq|vscatterdps|vscatterdpd|vscatterqps|vscatterqpd|vpconflictd|vpconflictq|vplzcntd|vplzcntq|vpbroadcastmb2q|vpbroadcastmw2d|vexp2pd|vexp2ps|vrcp28pd|vrcp28ps|vrcp28sd|vrcp28ss|vrsqrt28pd|vrsqrt28ps|vrsqrt28sd|vrsqrt28ss|vgatherpf0dps|vgatherpf0qps|vgatherpf0dpd|vgatherpf0qpd|vgatherpf1dps|vgatherpf1qps|vgatherpf1dpd|vgatherpf1qpd|vscatterpf0dps|vscatterpf0qps|vscatterpf0dpd|vscatterpf0qpd|vscatterpf1dps|vscatterpf1qps|vscatterpf1dpd|vscatterpf1qpd|vfpclassps|vfpclasspd|vfpclassss|vfpclasssd|vrangeps|vrangepd|vrangess|vrangesd|vreduceps|vreducepd|vreducess|vreducesd|vpmovm2d|vpmovm2q|vpmovm2b|vpmovm2w|vpmovd2m|vpmovq2m|vpmovb2m|vpmovw2m|vpmullq|vpmadd52luq|vpmadd52huq|v4fmaddps|v4fmaddss|v4fnmaddps|v4fnmaddss|vp4dpwssd|vp4dpwssds|vpdpbusd|vpdpbusds|vpdpwssd|vpdpwssds|vpcompressb|vpcompressw|vpexpandb|vpexpandw|vpshld|vpshldv|vpshrd|vpshrdv|vpopcntd|vpopcntq|vpopcntb|vpopcntw|vpshufbitqmb|gf2p8affineinvqb|gf2p8affineqb|gf2p8mulb|vpclmulqdq|vaesdec|vaesdeclast|vaesenc|vaesenclast)[ \t]/' mymodel.s

– source: c++ - How to check if compiled code uses SSE and AVX instructions? - Stack Overflow

However, if you are really unlucky, there is also the possibility that your model program contains AVX instructions but the code doesn’t spend much or any time executing those instructions at runtime.

You could also measure to see what instructions in your program actually execute at runtime, especially inside hot loops. One way to confirm that the AVX instructions are actually executing on linux is using the perf profiling tool. You could do something like:

perf record ./mymodel put your usual args here
# wait for mymodel to finish running
perf report

Using perf to analyse what the CPU is doing is invasive and requires elevated permissions for security reasons. On ubuntu I have to run the following commands to temporarily grant my user elevated permissions before I am allowed to run perf:

sudo sysctl -w kernel.perf_event_paranoid=-1
sudo sysctl -w kernel.kptr_restrict=0

Once profiling information has been collected, perf report will show a summary of how much time is spent inside each function. You can zoom in to view each function call and perf will show a listing of assembly code (also annotated with the names of functions from the C / C++ code if your program was compiled with -g debugging information) and the time spent executing each assembly instruction along the margin. By default when you zoom in to view assembly code of a function, perf will focus the viewport on the section of assembly code where the most time is spent. Then you could see which operations are executing and if any of them match up with the huge list of AVX instructions above.

More generally, see Brendan Gregg’s reference: Linux perf Examples

avehtari · November 29, 2021, 10:30am

Empty. I don’t remember it reported that any CXXFLAGS would give 30-50% time reduction, so I didn’t try to optimize that part.

I made the tests with Intel i7-10510U, which seems to have AVX2 (based on Intel website).

avehtari · November 29, 2021, 2:23pm

Thanks for reminding about these. I again forgot that march=native and mtune=native are not on by default, and they have a huge effect. The difference is now much smaller between Eigen internal and external BLAS/LAPACK. I’ll add the new results later. I think cmdstanr installer should ask if someone doesn’t want to use them, and by default use them as the speed difference is huge.

wds15 · November 29, 2021, 3:32pm

I’d suggest STAN_CPP_OPTIMS=true if you want to turn on most magic which @stevebronder once figured out to work ok. I think that includes the m* stuff.

ma-riviere · November 29, 2021, 3:54pm

Quick question: I see people adding those options in CmdStanR either as R code (i.e. list(STAN_CPP_OPTIMS = TRUE) ) or as strings with the cmake syntax (i.e. list("STAN_CPP_OPTIMS=true") ). Do both work ?

avehtari · November 29, 2021, 7:09pm

Where is this documented and how much speedup? I couldn’t find it in CmdStan User Guide’s chapter on Installation 1 CmdStan Installation | CmdStan User’s Guide
I looked in Makefile, and march ad mtune are not mentioned there. Most of the options mentioned are included in -O2, or gcc manual mentions that they may also make the program slower.
why it is CPP_? and not CXX_?
Is STAN_CPP_OPTIMS in different architectures?
It seems STAN_CPP_OPTIMS fails on Apple Silicon -fwhole-program-vtables in STAN_CPP_OPTIMS breaks linkage on Apple Silicon · Issue #1041 · stan-dev/cmdstan · GitHub
Anyway, we often tell people to install Stan using CmdStanR, and then those people don’t read any other installation guide, so suggestion to use any additonal options would be good. I think the march/mtune, which are enabling use of modern features of CPUs, are more important than the compilation optimization tweaks (and most useful optimizations are included in -O2). We’re talking here 30-50% reductions in sampling time for enabling march/mtune.

EDIT: fixed compile time → sampling time

wds15 · November 29, 2021, 7:26pm

Should not matter. Either way works…just check the make/local file.

wds15 · November 29, 2021, 7:30pm

It‘s key to get people going…and to start with people solve smaller problems where this stuff does not matter.

Optimizations have many pitfalls…e.g. the mtune/march stuff is not a good idea on windows. We barely get the bulk of users running smoothly, optimizations are - to me - secondary.

Secondary in the sense that once you are hooked into stan, then you will find the optimizations…after all I drove reduce_sum…

Topic		Replies	Views
Stan compile with Apple Accelerate BLAS / LAPACK? General compiler	4	1244	January 19, 2024
Compiling Stan against Intel MKL General	9	2048	October 16, 2017
Compiling Stan with openBLAS (or similar) through Rstan RStan	3	1049	October 19, 2017
CmdStan: CPU faster than GPU? General cmdstan	6	2272	February 18, 2021
Rebuilding CmdStan with OpenCL on MacOS Interfaces cmdstan , cmdstanr	6	700	November 1, 2020

Speedup by using external BLAS/LAPACK with CmdStan and CmdStanR/Py

1. Introduction

2. normal linear regression with N=1e4, p=100

2.1. Stan code with y ~ normal(alpha + x*beta, sigma);

2.2. Stan code with y ~ normal_id_glm(x, alpha, beta, sigma);

2.3. Stan code generated with brm(y ~ ., ...)

3. GP with covariance matrix N=266

3.1. Stan code with L_f = cholesky_decompose(K_f); yn ~ multi_normal_cholesky(zeros, L_f);

3.2. Stan code with yn ~ multi_normal(zeros, K_f);