Speeding up RStan

Hi @bgoodri !

I followed your suggestions to look into RcppParallel for the purpose of linking with the Intel TBB. That seems to work great - and we should probably make use of the TBB malloc replacement already now in order to speedup Stan programs. That is, on macOS I have observed really nice speedups from linking in the libtbbmalloc_proxy library which is distributed with RcppParallel. So here is the speedup when using this under R using on 4 cores the warfarin example I used for StanCon:

rstan 2.18.2:
   user  system elapsed
473.629   6.424 138.410

rstan 2.18.2 with tbbmalloc_proxy from RcppParallel:
   user  system elapsed
324.721   5.540  96.020

So this is a decent speedup and I really only changed loading the respective library. Right now this is a bit non-straightforward. What I had to do is along these lines:

In ~/.R/Makevars:
LDFLAGS += /Users/weberse2/R/2019-03-20-transient/RcppParallel/lib/libtbbmalloc.dylib /Users/weberse2/R/2019-03-20-transient/RcppParallel/lib/libtbbmalloc_proxy.dylib

In the R sources:

tbbmalloc_proxy  <- system.file("lib/libtbbmalloc_proxy.dylib", package="RcppParallel")
tbbmalloc  <- system.file("lib/libtbbmalloc.dylib", package="RcppParallel")
tbblib  <- system.file("lib/", package="RcppParallel")



pd_model_par_tbb  <- stan_model("warfarin_pd_tlagMax_2par_generated_218_tbb.stan", verbose=TRUE)

There is probably a better way to do it which is in line with R conventions… but I do not know this conventions and wanted to make it work; and voila, we see almost 40% speedup on my single run here. I haven’t seen these speedups on Linux and I don’t know about Windows… and possibly this model is benefiting a lot more than others.

However, what about we make this available as an easy to use option in RStan?



I’ll look into. It is probably similar to how we link to StanHeaders to access the SUNDIALS shared object:

1 Like

Hi @bgoodri !

Have look at the very bottom of the PR link below. There you see that we get some neat speedups from linking in the tbbmalloc_proxy library. So 6% on average and up to 18% for some models. Again, this is for free, since the only thing which changed is linking against the scalable memory allocator from the TBB.


I do love free.

Oh… it’s better than that - free and good!