Rstan with TBB

Dear Stan community:
It’s been a while since I played with this. I created a stan model using map_rect, and I’d like to run it using several cores via TBB. I haven’t tried this with cmdstan yet, but I would prefer to use rstan for seamless model tuning. Details below. Am I doing something wrong, or is this just not supported yet in rstan?

Operating System: Clear Linux
Interface Version: 2.19.3
Compiler/Toolkit: g++ & clang++

devpkg-tbb is installed. I have also tried putting the tbb/include files into rstan/include, same outcome.

Makevars: (tried both g++ and clang++)

CXX=g++
CXX14FLAGS += -DSTAN_THREADS
CXX14FLAGS += -pthread
CXX14FLAGS += -fPIC

Error when compiling with stan_model in R:

Error in dyn.load(libLFile) : 
  unable to load shared object '/tmp/Rtmp1K845W/file6dc360d72dc.so':
  /tmp/Rtmp1K845W/file6dc360d72dc.so: undefined symbol: _ZTIN3tbb4taskE
Error in sink(type = "output") : invalid connection

Does the psapply example in

https://cran.r-project.org/web/packages/StanHeaders/vignettes/stanmath.html

work?

No. Lots of warnings, and the same error message (undefined symbol: _ZTIN3tbb4taskE).

It only compiles if I remove the Makevars file / omit the Sys.setenv(PKG_CXXFLAGS = "-DSTAN_THREADS") setting

Is cmdstanr an option for you? Head over to the stan GitHub where you can find that R package which makes running stan models from R with cmdstan easy.

1 Like

I’ll take a look.
Still, it would be nice to get it work in rstan, too.

It does work in rstan, just not for you apparently. Are you using the TBB from the RcppParallel package or from a package manager?

I have tried it two ways,
(1) I downloaded TBB from intel, and copied include dir into the rstan include directory
(2) removing (1), I used TBB from clear’s package manager

I forgot that for rstan 2.19.x, map_rect is implemented with just C++11 threads, rather than TBB. So, you don’t need any TBB stuff to get that to work.

For rstan 2.21.x — which is on GitHub but was not accepted by CRAN — you do need the RcppParallel package but you don’t need to be including TBB sources in rstan sources or anything like that.

1 Like

It ran once. Then created the make/local file, and it compiled fine.
However, now I have some cmdstanr issue, it throws this:

Error in gsub(addnlpat, "\\1\n", str) : 
  invalid regular expression '(.{1,256})(\s|$)', reason 'Invalid contents of {}'

Which I believe is an unrelated issue with cmdstanr.

So on rstan 2.19.2/3 it should multi-thread out of the box without changing the makevars file?

Also, I should really switch to the development version, I guess. The v8 dependency is nasty, though.

Any chance you could make an issue over here: Issues · stan-dev/cmdstanr · GitHub and post a model/data/R script that lead to this?

I’d love to, but now it works again. I have no idea what it was. Maybe something in the .stan file, that still let it compile but didn’t let it run? I’ll keep an eye on it.

Is there a way I can look at the cpp file with cmdstanr? I’d like to know if the DSTAN_THREADS flag worked. I still don’t see the sampler use multiple cores.

You should only need to add -DSTAN_THREADS -pthread to the compilation flags and then at runtime set the environmental variable STAN_NUM_THREADS.

Well, the C++ version of stanc will go away some day (hopefully soon), but installing the V8 R package should suffice if you have the underlying library installed from the package manager. Having a dependency on the OCaml libraries would be more onerous than JavaScript from R’s perspective.

Try compiling with threads = TRUE in the cmdstanr compile function (https://mc-stan.org/cmdstanr/reference/model-method-compile.html).

You also need to set the number of threads you’re using with set_num_threads (https://mc-stan.org/cmdstanr/reference/stan_threads.html).

Thanks. num_threads was correct, but I recompiled with threads=TRUE. Still, not using the cores well. It could be my model. I want to look at different cuts to create the shards. However, is there another way to verify that it’s using all the cores? When running, it says ‘Running MCMC with 1 chain(s) on 12 core(s)…’, but is that definitely correct?

That sounds right.

@wds15 anything special for map_rect?

Thanks so much. I found one issue in my model file, and now it’s working much faster. It’s still not producing 100% load on 12 cores, but pretty high load on 8 of them. So that’s pretty nice already. Now I can do more tweaking and tuning on my model.

(all using cmdstanr, which is nice btw)

2 Likes

The tbb version of map rect since version 2.21 is a lot faster than previous ones. This is only available with cmdstan at the moment.

It works quite well, and cmdstanr works like a charm. I think I want to try MPI next, since I have a number of cores to keep busy.
I am not sure what to mark as a solution. For me the real solution is to switch to cmdstan, since it actually uses TBB. Strictly speaking, the question was about rstan, so either using the development version or just using threads would be the answer. Either way, thanks everyone.