threading/MPI

linas · November 4, 2019, 8:40pm

Hi,

The node has 28 cpus. If stan model contains 6 shards should I start the sampler using

mpiexec -n 6 ./sampler sample …

when using MPI or
echo STAN_NUM_THREADS=6
./sampler sample …

I don’t want to set #of treads to -1 if not necessary, i.e. all 28 cpus won’t be used (since I will be charged for all cpus that are allocated to job even if not all cpus will be used).
Thank you

wds15 · November 4, 2019, 8:58pm

MPI takes priority over threading - so the mpirun -n 6 will do. I don’t recommend combining MPI with threading though. Just use MPI if that works for your system.

linas · November 4, 2019, 9:30pm

Thanks. Sometimes compilation with MPI breaks (usually after system maintenance, not stan fault) and I have to switch to threading. Should I use STAN_NUM_THREADS=6 then (or 7 - additional cpu for a process which starts the threads)?

wds15 · November 4, 2019, 10:21pm

Then just use threading since you do not need mpi to link different machines. With 6 shards you only need 6 threads.

linas · November 5, 2019, 12:14am

Thanks. 6 is just an example. Perhaps I need to formulate the question as:
what STAN_NUM_THREADS should be if there are N shards. I am trying to understand how threading works. My guess is that sampler starts N threads when it sees map_rect. Since it may start the thread on the same cpu or on different cpus I wonder if it starts threads on different cpus so I should allocate one cpu for sampler and N cpus for each shard. Or it starts first thread on the same cpu as sampler and N-1 threads on different cpus.

wds15 · November 5, 2019, 2:17pm

just don’t worry about this so much. The TBB distributes the work automagically. If you want to know more details on the details then read about parallel_for from the TBB API at Intel.

However, I would be surprised if you need to put a lot of effort into selecting STAN_NUM_THREADS, really. That’s the point of the TBB is its automatic scheduling taking care of this piece.

linas · November 6, 2019, 5:17pm

Problem is that I don’t want to request 28 cpus if max the sampler uses is 10. I get charged for what I request so I have to pay for 18 cpus which were not used (and other user didn’t have access to).

yizhang · November 6, 2019, 5:36pm

There are two ways to do thread-core affinity mapping: through TBB and through OS. I don’t think map_rect exposes Task_Scheduler_Observer of TBB(but I could be wrong @wds15) that can pin thread to core, so maybe best bet is to do it through OS(something like sched_setaffinity in linux). But either way requires doing it in C level.

Topic		Replies	Views
Threading, MPI, and TBB for users General performance	6	1312	December 16, 2019
Running cmdstanr in parallel on computing cluster General	6	902	December 9, 2022
Optimal num_stan_threads when using multiple chains General performance	5	1878	May 30, 2019
Threading and mpi and tbb and gpu in cmdstan CmdStan	4	831	October 30, 2019
What would happen if no. of shards or partial sums is greater than threads per chain? CmdStan	4	419	July 29, 2020

threading/MPI

Related topics