in preparation for the beta release of the cmdstanR interface, we are synchronizing arguments with cmdstanpy and a naming issue came up that we wanted to get all of your inputs on. It’s regarding naming the within-chain threads.
First, let me give more context. So the classic number of chains and number of CPU cores interface is well known from rstan:
chains: the number of chains you wish to run
cores: maximum number of cores to use for the chains
Meaning that
chains = 2, cores = 2 will run both chains simultaneously if the two cores are available
chains = 2, cores = 1 will run one chain at a time, one after the other
chains = 4, cores = 2 will run at most two chains simultaneously if the two cores are available
chains = 4, cores = 4 will run all four chains simultaneously if the four cores are available
And then enter the within-chain parallelization using reduce_sum or map_rect where you also need to specify the number of threads to use in a single chain. Say we name that argument chain_threads.
We then get the following examples:
chains = 2, cores = 2, chain_threads = 2 will run both chains simultaneously and each chain would use two threads, using four CPU cores if available
chains = 2, cores = 1, chain_threads = 2 will run one chain at a time, one after the other and the running chain would use two threads, using two CPU cores if available
chains = 4, cores = 2, chain_threads = 2 will run at most two chains simultaneously and the running chains would each use two threads, using four CPU cores at a time if available
chains = 4, cores = 4, chain_threads = 2 will run all four chains simultaneously and the running chains would each use two threads, using eight CPU cores if available
And now to the questions:
Does chain_threads sound confusing?
Is it obvious from the name what is going on?
Would chains = 4, cores = 4, threads = 2 be better?
Do you have any other naming suggestions?
HPC schedulers like SGE & PBS use global numbering consistently, while here we are using both global numbering and affinity numbering: chains=4 and cores=4 both indicate the total number of chains/core, but threads=2 indicates core-thread binding. This is potentially confusing.threads_per_chain is better but we’re still mixing two types of numbering. IMO chains=4, core=4, threads=8 is more intuitive.
We should spend a moment in where this is all going. In the hopefully not so distant future we have shared warmup features. Even if not, it will be an attractive feature to use 8 cores for 4 chains and let slower chains take advantage of frees resources from faster chains.
So maybe we adapt our current notion a bit as: when you said so far chains=4 and cores=8, then you got just 4 cores being used and the excess cores were just not used. So maybe this is changed now to mean that we use 2 threads per chain. Whenever it’s not an even split we could issue a warning message for now.
I read somewhere that the threads per core is likely not something that important for Stan since we need to use a rather large cache for the computations? If that is true (i.e., the cores matters the most) it would imply that the only really important factor for a user would be num cores. So, if we run four chains (not uncommon…), shouldn’t this be done automagically, i.e., setting chains = 4, cores=16, would mean that each core computes 25% of a chain and, thus we won’t need to use threads in most cases?
No…if your model uses reduce sum and your cpu cores have sufficient cache, then you will get more speed in many cases (still depends on the details of the model)
Ok, yeah, threads_per_chain is much better. I retract the chain_threads suggestion.
Agreed, its why I opened the thread. I would advocate for rstan going in the direction we settle on. But that is not my call to make.
So the current three ideas are
A: chains=4 , core=4 , threads_per_chain=2
B: chains=4 , cores=4 , threads=8 => number of threads is calculated as threads/cores
C: chains=4, cores=8 => number of threads is calculated as cores/chains
What happens if the numbers are not divisible in B and C is a different story. Options are floor, ceil, warning, stop.
we need to define default behavior for when user only specifies chains. as currently implemented in CmdStanPy, which only has chains and cores, this is the spec for cores:
cores – Number of processes to run in parallel. Must be an integer between 1 and the number of CPUs in the system. If none then set automatically to chains but no more than total_cpu_count - 2
so we need:
good user-facing doc for either option A or C
consistent implementation across the CmdStanX interfaces, decisions on defaults, warnings vs. errors
design for CmdStan3
agreed that we need guidelines for cloud and cluster computing (SLURM anyone?). on the Columbia HPC cluster, one can request a node (single machine, 24 cores) or a number of cores, not necessarily on the same node.
I find it very confusing that if I define cores = 4 more than 4 cores would be used. Here it seems like there should be option simultaneous_chains = 4 to make it sensible to run on eight cores. However, I would keep the cores = 4 option but really let it mean that maximum 4 cores are used. and chains = 4, cores = 4, threads_per_chain = 2 would run two chains simultaneously and the running chains would each use two threads, using four cores if available.
To reiterate (I am repeating this again and again so we dont lose track:
A: chains=4 , cores=4 , threads_per_chain=2
If available, threads_per_chain*cores CPU cores are used for sampling.
B: chains=4 , cores=4 , threads=8
The number of threads for a chain is calculated as threads/cores. If available, threads CPU cores are used for sampling. If the number is not divisible warn the user (with floor or ceil for threads/cores) or stop execution.
C: chains=4 , cores=8 => number of threads is calculated as cores/chains
The number of threads for a chain is calculated as cores/chains. If available, cores CPU cores are used for sampling. If the number is not divisible warn the user (with floor or ceil forcores/chains) or stop execution.
D: chains=4 , cores=4 , threads_per_chain=2
Contrary to A, this means that the the maximum allowed CPU cores in use is specified by cores. In this example we would first run 2 chains each with 2 threads, followed by the remaining two chains.
Of the current ones I prefer D. I like it because we do not have to deal with warnings/stops and the maximum used cores is specified by cores which seems more obvious than A. It works for our current approach (all chains get the same amount of threads) and at first glance I feel like its also future proof for if/when we get shared warmup or just support for multiple chains within a single executable
That would allow “resource stealing” and chains with uneven numbers of threads. chains=4 , cores=5 , threads_per_chain=2 would in that context mean: use 5 cores, but not more than 2 threads per chain. Omitting the threads_per_chain would mean: do what you want with the 5 cores. This might be what @wds15 was looking for.
Yeah, I view this from the perspective of having our Intel TBB threadpool. From this angle it makes sense to just say what the total resources are. Thus “cores” should specify the total number of cores being used, I think.
I like @avehtari suggestion and I also wonder if we would need also a simultaneous_chains option as suggested?
The thing is, we need to say what the totals (chains, CPUs) and what we want to run simultaneously. Maybe it’s better to have two additional options (threads_per_chain & simultaneous_chains) with sensible defaults, which gives users full control if they really want to.
(admittedly I would prefer num_chains & num_cores for the totals, and threads_per_chain/ cores_per_chain & simultaneous_chains for what we use as it executes… but that’s neglecting where we come from)
CmdStanPy is using Python’s subprocess module to spawn new processes - this is limited by cores, not threads. if we compile with STAN_THREADS=TRUE, then TBB theading handles threads, but I don’t see anywhere where you get to actually specify how many threads.
how does threads_per_chain work in CmdStanR?
threads_per_chain seems to be ignored in the latest version of the sampling function (0.0.0.9004). Before cmdstanr was updated, set_num_threads() worked fine but now using that function throws an error. @wds15, do I need to set STAN_NUM_THREADS separately from threads_per_chain?