Threading in rstan 2.18


#1

Is multi threading working in rstan 2.18? I know it supposed to work with cmdstan, but I couldn’t find any info regarding rstan. I tried to use it with map_rect, I didn’t get any error messages, but how do I know it worked?


#2

It can be made to work, but there is nothing in the output to distinguish it from serial execution. You just have to look at the application that monitors CPU usage to see that all the cores are being used even if you only have one chain at a time.


#3

Just to follow up on this, I guess you would still need to add CXXFLAGS += -DSTAN_THREADS, but then how would you invoke rstan::sampling? Does cores there correspond to the number of cores used for between-chain parallelisation or within-chain parallelisation?


#4

I figured it out, you need to do something like this

Sys.setenv(STAN_NUM_THREADS=3)
fit <- sampling(sm, data=stan_data, seed=42, chains=4, cores=1, iter=10000)

This would run 4 chains sequentially, each with 3 threads.

In the figure above, note that the threads column does not refer to the actual number of threads used by map_rect, there is some offset here… But the %CPU column indicates it…


#5

Thanks, I also changed the CXXFLAGS, but I didn’t change the STAN_NUM_THREADS. I’ll check the usage, I guess I should see something like that in Linux as well.


#6

I think this should be CXX14FLAGS += -DSTAN_THREADS in ~/.R/Makevars and Sys.setenv(STAN_NUM_THREADS = ?).


#7

Thank you for a discussion on this topic. I’m having issues with using rstan threading on a Linux server (CentOS Linux 7; gcc 8.2).

I have included CXX14FLAGS += -DSTAN_THREADS in my Makevars and Sys.setenv(STAN_NUM_THREADS = 4) in my R code before stan gets called.

However, using “top”, I see that my stan programme containing map_rect is running, but it is only using one core.

My Makevars contain:
CXX14 = g++
CXX14FLAGS = -DSTAN_THREADS
CXX14FLAGS += -O3 -march=native -mtune=native
CXX14FLAGS += -fPIC

Thank you in advance for your help.


#8

How do you determine that it is only using one core? Threading should reveal itself by a cpu load > 100%, however not necessarily #Threads * 100%…


#9

Thank you for your reply, ermeel.

I was naively expecting that threading meant parallel computing over multiple cores. I saw that the number of active cores equaled the number of chains, and thought to myself that threading had not been successfully implemented. Would you say that I was wrong about this interpretation? If so, are there ways to run parallel on multiple cores as well?

Thank you very much again,


#10

What is the CPU load ( “% CPU” above) for each of the chains running in parallel [I guess you set this via the `cores=` argument directly or globally via `options(mc.cores = )`]?

Note, there are two levels of parallelisation: E.g. you could have four chains running in parallel (mc.cores=4), whereas each uses 4 threads (STAN_NUM_THREADS=4)… Provided you have sufficient resources and it is parallelizable well enough, I would expect you should have 4 entries in top, but each frequently exceeding 100 in the “%CPU” column of top.

Something like this (here I set STAN_NUM_THREADS=3 and chains=2):

Maybe @wds15 or @bgoodri can also comment on this.


#11

What you describe sounds all right. With multiple threaded chains you get multiple processes and each will consume more than 100% CPU usage.


#12

Thank you for your detailed reply, emreel.

I can see now that the %CPU is consistently around 200%, and my test suggests that threading ( STAN_NUM_THREADS=10) allows my programme to complete in approx. half the time.


#13

Ok, now that I figured out how to implement map_rect in a real example, I realized it’s not really using more threads. (I’ve tried it in two computers, mine and a server):
This is Makevars in my computer:

CXXFLAGS=-O3 -mtune=native -march=native -Wno-unused-variable -Wno-unused-function -Wno-macro-redefined
CXXFLAGS+=-flto -Wno-unused-local-typedefs
CXXFLAGS += -DSTAN_THREADS
CXXFLAGS += -pthread

This is on the server:

CXX14FLAGS=-O3 -march=native -mtune=native -fPIC
CXX14STD = -std=c++14
CXX14 = g++
CXXFLAGS +=-flto -Wno-unused-local-typedefs
CXXFLAGS += -DSTAN_THREADS
CXXFLAGS += -pthread

I’m attaching some tests. The R files generates fake data and fits the data to the version with and without multithreading from the manual
map_rect.R (1.0 KB)

This is the version with multi threading
map_rect_exp.stan (883 Bytes)

This is without:
no_map_rect.stan (405 Bytes)

They take more or less the same time in the server (but in my computer the map_rect version is actually much slower), and I’m using only one chain to see if that chain uses more than 100%, but it’s not the case according to top and htop. Both systems have Ubuntu and latest rstan 2.18.

Any suggestions?