Hi @bgoodri !
I followed your suggestions to look into RcppParallel
for the purpose of linking with the Intel TBB. That seems to work great - and we should probably make use of the TBB malloc replacement already now in order to speedup Stan programs. That is, on macOS I have observed really nice speedups from linking in the libtbbmalloc_proxy library which is distributed with RcppParallel
. So here is the speedup when using this under R using on 4 cores the warfarin example I used for StanCon:
rstan 2.18.2:
user system elapsed
473.629 6.424 138.410
rstan 2.18.2 with tbbmalloc_proxy from RcppParallel:
user system elapsed
324.721 5.540 96.020
So this is a decent speedup and I really only changed loading the respective library. Right now this is a bit non-straightforward. What I had to do is along these lines:
In ~/.R/Makevars:
LDFLAGS += /Users/weberse2/R/2019-03-20-transient/RcppParallel/lib/libtbbmalloc.dylib /Users/weberse2/R/2019-03-20-transient/RcppParallel/lib/libtbbmalloc_proxy.dylib
In the R sources:
tbbmalloc_proxy <- system.file("lib/libtbbmalloc_proxy.dylib", package="RcppParallel")
tbbmalloc <- system.file("lib/libtbbmalloc.dylib", package="RcppParallel")
tbblib <- system.file("lib/", package="RcppParallel")
Sys.getenv("DYLD_LIBRARY_PATH")
Sys.setenv(DYLD_LIBRARY_PATH=tbblib)
Sys.getenv("DYLD_LIBRARY_PATH")
dyn.load(tbbmalloc_proxy)
dyn.load(tbbmalloc)
pd_model_par_tbb <- stan_model("warfarin_pd_tlagMax_2par_generated_218_tbb.stan", verbose=TRUE)
There is probably a better way to do it which is in line with R conventions… but I do not know this conventions and wanted to make it work; and voila, we see almost 40% speedup on my single run here. I haven’t seen these speedups on Linux and I don’t know about Windows… and possibly this model is benefiting a lot more than others.
However, what about we make this available as an easy to use option in RStan
?
Best,
Sebastian