The node has 28 cpus. If stan model contains 6 shards should I start the sampler using
mpiexec -n 6 ./sampler sample …
when using MPI or
echo STAN_NUM_THREADS=6
./sampler sample …
I don’t want to set #of treads to -1 if not necessary, i.e. all 28 cpus won’t be used (since I will be charged for all cpus that are allocated to job even if not all cpus will be used).
Thank you
MPI takes priority over threading - so the mpirun -n 6 will do. I don’t recommend combining MPI with threading though. Just use MPI if that works for your system.
Thanks. Sometimes compilation with MPI breaks (usually after system maintenance, not stan fault) and I have to switch to threading. Should I use STAN_NUM_THREADS=6 then (or 7 - additional cpu for a process which starts the threads)?
Thanks. 6 is just an example. Perhaps I need to formulate the question as:
what STAN_NUM_THREADS should be if there are N shards. I am trying to understand how threading works. My guess is that sampler starts N threads when it sees map_rect. Since it may start the thread on the same cpu or on different cpus I wonder if it starts threads on different cpus so I should allocate one cpu for sampler and N cpus for each shard. Or it starts first thread on the same cpu as sampler and N-1 threads on different cpus.
just don’t worry about this so much. The TBB distributes the work automagically. If you want to know more details on the details then read about parallel_for from the TBB API at Intel.
However, I would be surprised if you need to put a lot of effort into selecting STAN_NUM_THREADS, really. That’s the point of the TBB is its automatic scheduling taking care of this piece.
Problem is that I don’t want to request 28 cpus if max the sampler uses is 10. I get charged for what I request so I have to pay for 18 cpus which were not used (and other user didn’t have access to).
There are two ways to do thread-core affinity mapping: through TBB and through OS. I don’t think map_rect exposes Task_Scheduler_Observer of TBB(but I could be wrong @wds15) that can pin thread to core, so maybe best bet is to do it through OS(something like sched_setaffinity in linux). But either way requires doing it in C level.