Speeding up RStan

Hi @bgoodri !

I followed your suggestions to look into RcppParallel for the purpose of linking with the Intel TBB. That seems to work great - and we should probably make use of the TBB malloc replacement already now in order to speedup Stan programs. That is, on macOS I have observed really nice speedups from linking in the libtbbmalloc_proxy library which is distributed with RcppParallel. So here is the speedup when using this under R using on 4 cores the warfarin example I used for StanCon:

rstan 2.18.2:
   user  system elapsed
473.629   6.424 138.410


rstan 2.18.2 with tbbmalloc_proxy from RcppParallel:
   user  system elapsed
324.721   5.540  96.020

So this is a decent speedup and I really only changed loading the respective library. Right now this is a bit non-straightforward. What I had to do is along these lines:

In ~/.R/Makevars:
LDFLAGS += /Users/weberse2/R/2019-03-20-transient/RcppParallel/lib/libtbbmalloc.dylib /Users/weberse2/R/2019-03-20-transient/RcppParallel/lib/libtbbmalloc_proxy.dylib

In the R sources:

tbbmalloc_proxy  <- system.file("lib/libtbbmalloc_proxy.dylib", package="RcppParallel")
tbbmalloc  <- system.file("lib/libtbbmalloc.dylib", package="RcppParallel")
tbblib  <- system.file("lib/", package="RcppParallel")

Sys.getenv("DYLD_LIBRARY_PATH")
Sys.setenv(DYLD_LIBRARY_PATH=tbblib)
Sys.getenv("DYLD_LIBRARY_PATH")

dyn.load(tbbmalloc_proxy)
dyn.load(tbbmalloc)

pd_model_par_tbb  <- stan_model("warfarin_pd_tlagMax_2par_generated_218_tbb.stan", verbose=TRUE)

There is probably a better way to do it which is in line with R conventions… but I do not know this conventions and wanted to make it work; and voila, we see almost 40% speedup on my single run here. I haven’t seen these speedups on Linux and I don’t know about Windows… and possibly this model is benefiting a lot more than others.

However, what about we make this available as an easy to use option in RStan?

Best,
Sebastian

2 Likes

I’ll look into. It is probably similar to how we link to StanHeaders to access the SUNDIALS shared object:

1 Like

Hi @bgoodri !

Have look at the very bottom of the PR link below. There you see that we get some neat speedups from linking in the tbbmalloc_proxy library. So 6% on average and up to 18% for some models. Again, this is for free, since the only thing which changed is linking against the scalable memory allocator from the TBB.

2 Likes

I do love free.

Oh… it’s better than that - free and good!