Help with naming threading argument

rok_cesnovar · May 16, 2020, 5:50pm

Hi,

in preparation for the beta release of the cmdstanR interface, we are synchronizing arguments with cmdstanpy and a naming issue came up that we wanted to get all of your inputs on. It’s regarding naming the within-chain threads.

First, let me give more context. So the classic number of chains and number of CPU cores interface is well known from rstan:

chains: the number of chains you wish to run
cores: maximum number of cores to use for the chains

Meaning that

chains = 2, cores = 2 will run both chains simultaneously if the two cores are available
chains = 2, cores = 1 will run one chain at a time, one after the other
chains = 4, cores = 2 will run at most two chains simultaneously if the two cores are available
chains = 4, cores = 4 will run all four chains simultaneously if the four cores are available

And then enter the within-chain parallelization using reduce_sum or map_rect where you also need to specify the number of threads to use in a single chain. Say we name that argument chain_threads.

We then get the following examples:

chains = 2, cores = 2, chain_threads = 2 will run both chains simultaneously and each chain would use two threads, using four CPU cores if available
chains = 2, cores = 1, chain_threads = 2 will run one chain at a time, one after the other and the running chain would use two threads, using two CPU cores if available
chains = 4, cores = 2, chain_threads = 2 will run at most two chains simultaneously and the running chains would each use two threads, using four CPU cores at a time if available
chains = 4, cores = 4, chain_threads = 2 will run all four chains simultaneously and the running chains would each use two threads, using eight CPU cores if available

And now to the questions:

Does chain_threads sound confusing?
Is it obvious from the name what is going on?

Would chains = 4, cores = 4, threads = 2 be better?
Do you have any other naming suggestions?

Thanks!

mitzimorris · May 16, 2020, 7:55pm

yes it sounds confusing. not obvious from the name at all.

Would chains = 4, cores = 4, threads = 2 be better?

yes, much better. threads_per_chain? wordy, but explicit.

yizhang · May 17, 2020, 4:24am

HPC schedulers like SGE & PBS use global numbering consistently, while here we are using both global numbering and affinity numbering: chains=4 and cores=4 both indicate the total number of chains/core, but threads=2 indicates core-thread binding. This is potentially confusing.threads_per_chain is better but we’re still mixing two types of numbering. IMO chains=4, core=4, threads=8 is more intuitive.

wds15 · May 17, 2020, 9:04am

I also find chain_threads confusing.

We should spend a moment in where this is all going. In the hopefully not so distant future we have shared warmup features. Even if not, it will be an attractive feature to use 8 cores for 4 chains and let slower chains take advantage of frees resources from faster chains.

So maybe we adapt our current notion a bit as: when you said so far chains=4 and cores=8, then you got just 4 cores being used and the excess cores were just not used. So maybe this is changed now to mean that we use 2 threads per chain. Whenever it’s not an even split we could issue a warning message for now.

torkar · May 17, 2020, 9:05am

I read somewhere that the threads per core is likely not something that important for Stan since we need to use a rather large cache for the computations? If that is true (i.e., the cores matters the most) it would imply that the only really important factor for a user would be num cores. So, if we run four chains (not uncommon…), shouldn’t this be done automagically, i.e., setting chains = 4, cores=16, would mean that each core computes 25% of a chain and, thus we won’t need to use threads in most cases?

wds15 · May 17, 2020, 9:07am

No…if your model uses reduce sum and your cpu cores have sufficient cache, then you will get more speed in many cases (still depends on the details of the model)

rok_cesnovar · May 17, 2020, 9:32am

Ok, yeah, threads_per_chain is much better. I retract the chain_threads suggestion.

Agreed, its why I opened the thread. I would advocate for rstan going in the direction we settle on. But that is not my call to make.

So the current three ideas are

A: chains=4 , core=4 , threads_per_chain=2
B: chains=4 , cores=4 , threads=8 => number of threads is calculated as threads/cores
C: chains=4, cores=8 => number of threads is calculated as cores/chains

What happens if the numbers are not divisible in B and C is a different story. Options are floor, ceil, warning, stop.

torkar · May 17, 2020, 9:33am

Sebastian, how about the case when we run everything in the cloud, i.e., when we have a bunch of VCPUs?

processor	: 15
vendor_id	: GenuineIntel
cpu family	: 6
model		: 85
model name	: Intel Xeon Processor (Skylake, IBRS)
stepping	: 4
microcode	: 0x1
cpu MHz		: 2095.074
cache size	: 16384 KB
physical id	: 15
siblings	: 1
core id		: 0
cpu cores	: 1
apicid		: 15
initial apicid	: 15
fpu		: yes
fpu_exception	: yes
cpuid level	: 13

mitzimorris · May 17, 2020, 4:54pm

option A: yes
option B: no!
option C: not sure.

we need to define default behavior for when user only specifies chains. as currently implemented in CmdStanPy, which only has chains and cores, this is the spec for cores:

cores – Number of processes to run in parallel. Must be an integer between 1 and the number of CPUs in the system. If none then set automatically to chains but no more than total_cpu_count - 2

so we need:

good user-facing doc for either option A or C
consistent implementation across the CmdStanX interfaces, decisions on defaults, warnings vs. errors
design for CmdStan3

agreed that we need guidelines for cloud and cluster computing (SLURM anyone?). on the Columbia HPC cluster, one can request a node (single machine, 24 cores) or a number of cores, not necessarily on the same node.

avehtari · May 18, 2020, 8:10am

I find it very confusing that if I define cores = 4 more than 4 cores would be used. Here it seems like there should be option simultaneous_chains = 4 to make it sensible to run on eight cores. However, I would keep the cores = 4 option but really let it mean that maximum 4 cores are used. and chains = 4, cores = 4, threads_per_chain = 2 would run two chains simultaneously and the running chains would each use two threads, using four cores if available.

rok_cesnovar · May 18, 2020, 9:42am

Good point and a great idea.

To reiterate (I am repeating this again and again so we dont lose track:

A: chains=4 , cores=4 , threads_per_chain=2

If available, threads_per_chain*cores CPU cores are used for sampling.

B: chains=4 , cores=4 , threads=8

The number of threads for a chain is calculated as threads/cores. If available, threads CPU cores are used for sampling. If the number is not divisible warn the user (with floor or ceil for threads/cores) or stop execution.

C: chains=4 , cores=8 => number of threads is calculated as cores/chains

The number of threads for a chain is calculated as cores/chains. If available, cores CPU cores are used for sampling. If the number is not divisible warn the user (with floor or ceil forcores/chains) or stop execution.

D: chains=4 , cores=4 , threads_per_chain=2

Contrary to A, this means that the the maximum allowed CPU cores in use is specified by cores. In this example we would first run 2 chains each with 2 threads, followed by the remaining two chains.

Of the current ones I prefer D. I like it because we do not have to deal with warnings/stops and the maximum used cores is specified by cores which seems more obvious than A. It works for our current approach (all chains get the same amount of threads) and at first glance I feel like its also future proof for if/when we get shared warmup or just support for multiple chains within a single executable

That would allow “resource stealing” and chains with uneven numbers of threads. chains=4 , cores=5 , threads_per_chain=2 would in that context mean: use 5 cores, but not more than 2 threads per chain. Omitting the threads_per_chain would mean: do what you want with the 5 cores. This might be what @wds15 was looking for.

wds15 · May 18, 2020, 12:29pm

Yeah, I view this from the perspective of having our Intel TBB threadpool. From this angle it makes sense to just say what the total resources are. Thus “cores” should specify the total number of cores being used, I think.

I like @avehtari suggestion and I also wonder if we would need also a simultaneous_chains option as suggested?

The thing is, we need to say what the totals (chains, CPUs) and what we want to run simultaneously. Maybe it’s better to have two additional options (threads_per_chain & simultaneous_chains) with sensible defaults, which gives users full control if they really want to.

(admittedly I would prefer num_chains & num_cores for the totals, and threads_per_chain/ cores_per_chain & simultaneous_chains for what we use as it executes… but that’s neglecting where we come from)

rok_cesnovar · May 27, 2020, 8:05am

Lets go with @avehtari’s suggestion then: chains=4 , cores=4 , threads_per_chain=2 meaning 4 cores will be used at maximum, with 2 threads per chain.

simultaneous_chains can be added later if we find it to be useful. It doesn’t really change the meaning of chains , cores or threads_per_chain.

mitzimorris · May 28, 2020, 7:01pm

@rok_cesnovar - how exactly can you limit threads per chain via CmdStan?

asking myself this after answering Shira’s question about running CmdStan here: Cmdstanr reduce sum case study, but: unused argument (threads = TRUE)

CmdStanPy is using Python’s subprocess module to spawn new processes - this is limited by cores, not threads. if we compile with STAN_THREADS=TRUE, then TBB theading handles threads, but I don’t see anywhere where you get to actually specify how many threads.
how does threads_per_chain work in CmdStanR?

wds15 · May 28, 2020, 7:27pm

You have to define the environment variable STAN_NUM_THREADS to the desired number of threads for the chain you start.

EDIT adding a proper way of defining it via the services would be great, but so far the services are thread agnostic.

ssp3nc3r · June 12, 2020, 8:08pm

threads_per_chain seems to be ignored in the latest version of the sampling function (0.0.0.9004). Before cmdstanr was updated, set_num_threads() worked fine but now using that function throws an error. @wds15, do I need to set STAN_NUM_THREADS separately from threads_per_chain?

rok_cesnovar · June 12, 2020, 8:27pm

Can you post your call of $sample()? Set num threads was deprecated in favor of the new interface.

ssp3nc3r · June 12, 2020, 8:31pm

The following now only engages 4 cores:

fit <- m$sample(data = dat,
                adapt_delta = 0.8,
                max_treedepth = 10,
                chains = 4,
                parallel_chains = 4,
                iter_warmup = 500,
                iter_sampling = 1000,
                threads_per_chain = 15,
                seed = 123,
                refresh = 15
)

whereas in the previous version setting set_num_threads(15) engaged 60 cores (4 * 15).

rok_cesnovar · June 12, 2020, 8:35pm

You no longer need cores, you did not specify the number of chains though.

It should be
chains = 4,
parallel_chains = 4,
threads_per_chain = 15
and no cores.

Can you try that? Will check in the morning if we introduced any bugs. Its totally possible.

ssp3nc3r · June 12, 2020, 8:36pm

I edited my sample, accidentally typed it in the forum incorrectly.

Topic		Replies	Views
Cmdstanpy: multithreading issues (threads_per_chain) CmdStan cmdstanpy	2	541	December 13, 2023
Reduce_sum cores, chains, threads Interfaces cmdstanr	13	1803	May 28, 2020
Four chains vs four jobs General cmdstan	28	229	June 19, 2024
Help with reduce_sum Modeling	32	1453	August 4, 2020
Threading in rstan 2.18 General	30	4166	March 26, 2020

Help with naming threading argument

Related topics