Threading in rstan 2.18

Is multi threading working in rstan 2.18? I know it supposed to work with cmdstan, but I couldn’t find any info regarding rstan. I tried to use it with map_rect, I didn’t get any error messages, but how do I know it worked?


It can be made to work, but there is nothing in the output to distinguish it from serial execution. You just have to look at the application that monitors CPU usage to see that all the cores are being used even if you only have one chain at a time.

1 Like

Just to follow up on this, I guess you would still need to add CXXFLAGS += -DSTAN_THREADS, but then how would you invoke rstan::sampling? Does cores there correspond to the number of cores used for between-chain parallelisation or within-chain parallelisation?

I figured it out, you need to do something like this

fit <- sampling(sm, data=stan_data, seed=42, chains=4, cores=1, iter=10000)

This would run 4 chains sequentially, each with 3 threads.

In the figure above, note that the threads column does not refer to the actual number of threads used by map_rect, there is some offset here… But the %CPU column indicates it…

1 Like

Thanks, I also changed the CXXFLAGS, but I didn’t change the STAN_NUM_THREADS. I’ll check the usage, I guess I should see something like that in Linux as well.

I think this should be CXX14FLAGS += -DSTAN_THREADS in ~/.R/Makevars and Sys.setenv(STAN_NUM_THREADS = ?).

Thank you for a discussion on this topic. I’m having issues with using rstan threading on a Linux server (CentOS Linux 7; gcc 8.2).

I have included CXX14FLAGS += -DSTAN_THREADS in my Makevars and Sys.setenv(STAN_NUM_THREADS = 4) in my R code before stan gets called.

However, using “top”, I see that my stan programme containing map_rect is running, but it is only using one core.

My Makevars contain:
CXX14 = g++
CXX14FLAGS += -O3 -march=native -mtune=native

Thank you in advance for your help.

How do you determine that it is only using one core? Threading should reveal itself by a cpu load > 100%, however not necessarily #Threads * 100%…

Thank you for your reply, ermeel.

I was naively expecting that threading meant parallel computing over multiple cores. I saw that the number of active cores equaled the number of chains, and thought to myself that threading had not been successfully implemented. Would you say that I was wrong about this interpretation? If so, are there ways to run parallel on multiple cores as well?

Thank you very much again,

What is the CPU load ( “% CPU” above) for each of the chains running in parallel [I guess you set this via the `cores=` argument directly or globally via `options(mc.cores = )`]?

Note, there are two levels of parallelisation: E.g. you could have four chains running in parallel (mc.cores=4), whereas each uses 4 threads (STAN_NUM_THREADS=4)… Provided you have sufficient resources and it is parallelizable well enough, I would expect you should have 4 entries in top, but each frequently exceeding 100 in the “%CPU” column of top.

Something like this (here I set STAN_NUM_THREADS=3 and chains=2):

Maybe @wds15 or @bgoodri can also comment on this.


What you describe sounds all right. With multiple threaded chains you get multiple processes and each will consume more than 100% CPU usage.

Thank you for your detailed reply, emreel.

I can see now that the %CPU is consistently around 200%, and my test suggests that threading ( STAN_NUM_THREADS=10) allows my programme to complete in approx. half the time.

1 Like

Ok, now that I figured out how to implement map_rect in a real example, I realized it’s not really using more threads. (I’ve tried it in two computers, mine and a server):
This is Makevars in my computer:

CXXFLAGS=-O3 -mtune=native -march=native -Wno-unused-variable -Wno-unused-function -Wno-macro-redefined
CXXFLAGS+=-flto -Wno-unused-local-typedefs
CXXFLAGS += -pthread

This is on the server:

CXX14FLAGS=-O3 -march=native -mtune=native -fPIC
CXX14STD = -std=c++14
CXX14 = g++
CXXFLAGS +=-flto -Wno-unused-local-typedefs
CXXFLAGS += -pthread

I’m attaching some tests. The R files generates fake data and fits the data to the version with and without multithreading from the manual
map_rect.R (1.0 KB)

This is the version with multi threading
map_rect_exp.stan (883 Bytes)

This is without:
no_map_rect.stan (405 Bytes)

They take more or less the same time in the server (but in my computer the map_rect version is actually much slower), and I’m using only one chain to see if that chain uses more than 100%, but it’s not the case according to top and htop. Both systems have Ubuntu and latest rstan 2.18.

Any suggestions?

There was a small error in the R file, I uploaded it again
map_rect.R (1.0 KB)

But I still can’t multi thread, any pointers will be greatly appreciated!!

For completeness, I just wanted to update that everything worked in cmdstan. So it’s an rstan issue.

R might be grabbing the wrong Makevars file, when you look at the compilation output you should have a line go by like g++ <blah blah blah> -DSTAN_THREADS <blah blah>. It’s easiest to find if you capture the output to a file and then search for it…

Somewhat late but I think the only thing you were doing wrong is not including:


It appears that CXXFLAGS does not have the same effect; as I just discovered on my computer.


Hi @bgoodri,

I need some important clarifications on the below question if you have free time.

My Mac has 7 cores. Is it like if I want to run 7 chains in parallel, I can do so by making each of those single cores (out of 7) run 1 chain simultaneously?

Also, what is multithreading? Number of threads = Number of cores? Is it this? Requesting you to help clear this basic doubt

Just specify chains = 7 and cores = 7. There is no need to get into multithreading unless you want to use multiple threads per chain, and that would require that you rewrite your model using the map_rect function.

1 Like

I made the following change also in the Makevars file,

I think this should be CXX14FLAGS += -DSTAN_THREADS in ~/.R/Makevars and Sys.setenv(STAN_NUM_THREADS = ?) .

The above is not needed for using multiple cores?

Also where and how should I specify chains = 7 and cores = 7? (if I do this, this means I can complete the simulation for all 7 chains in just 1 chains runtime?)