Help with naming threading argument

I also find chain_threads confusing.

We should spend a moment in where this is all going. In the hopefully not so distant future we have shared warmup features. Even if not, it will be an attractive feature to use 8 cores for 4 chains and let slower chains take advantage of frees resources from faster chains.

So maybe we adapt our current notion a bit as: when you said so far chains=4 and cores=8, then you got just 4 cores being used and the excess cores were just not used. So maybe this is changed now to mean that we use 2 threads per chain. Whenever it’s not an even split we could issue a warning message for now.

1 Like

I read somewhere that the threads per core is likely not something that important for Stan since we need to use a rather large cache for the computations? If that is true (i.e., the cores matters the most) it would imply that the only really important factor for a user would be num cores. So, if we run four chains (not uncommon…), shouldn’t this be done automagically, i.e., setting chains = 4, cores=16, would mean that each core computes 25% of a chain and, thus we won’t need to use threads in most cases?

No…if your model uses reduce sum and your cpu cores have sufficient cache, then you will get more speed in many cases (still depends on the details of the model)

Ok, yeah, threads_per_chain is much better. I retract the chain_threads suggestion.

Agreed, its why I opened the thread. I would advocate for rstan going in the direction we settle on. But that is not my call to make.

So the current three ideas are

A: chains=4 , core=4 , threads_per_chain=2
B: chains=4 , cores=4 , threads=8 => number of threads is calculated as threads/cores
C: chains=4, cores=8 => number of threads is calculated as cores/chains

What happens if the numbers are not divisible in B and C is a different story. Options are floor, ceil, warning, stop.

1 Like

Sebastian, how about the case when we run everything in the cloud, i.e., when we have a bunch of VCPUs?

processor	: 15
vendor_id	: GenuineIntel
cpu family	: 6
model		: 85
model name	: Intel Xeon Processor (Skylake, IBRS)
stepping	: 4
microcode	: 0x1
cpu MHz		: 2095.074
cache size	: 16384 KB
physical id	: 15
siblings	: 1
core id		: 0
cpu cores	: 1
apicid		: 15
initial apicid	: 15
fpu		: yes
fpu_exception	: yes
cpuid level	: 13

option A: yes
option B: no!
option C: not sure.

we need to define default behavior for when user only specifies chains. as currently implemented in CmdStanPy, which only has chains and cores, this is the spec for cores:

  • cores – Number of processes to run in parallel. Must be an integer between 1 and the number of CPUs in the system. If none then set automatically to chains but no more than total_cpu_count - 2

so we need:

  • good user-facing doc for either option A or C
  • consistent implementation across the CmdStanX interfaces, decisions on defaults, warnings vs. errors
  • design for CmdStan3

agreed that we need guidelines for cloud and cluster computing (SLURM anyone?). on the Columbia HPC cluster, one can request a node (single machine, 24 cores) or a number of cores, not necessarily on the same node.

1 Like

I find it very confusing that if I define cores = 4 more than 4 cores would be used. Here it seems like there should be option simultaneous_chains = 4 to make it sensible to run on eight cores. However, I would keep the cores = 4 option but really let it mean that maximum 4 cores are used. and chains = 4, cores = 4, threads_per_chain = 2 would run two chains simultaneously and the running chains would each use two threads, using four cores if available.

2 Likes

Good point and a great idea.

To reiterate (I am repeating this again and again so we dont lose track:

  • A: chains=4 , cores=4 , threads_per_chain=2

If available, threads_per_chain*cores CPU cores are used for sampling.

  • B: chains=4 , cores=4 , threads=8

The number of threads for a chain is calculated as threads/cores. If available, threads CPU cores are used for sampling. If the number is not divisible warn the user (with floor or ceil for threads/cores) or stop execution.

  • C: chains=4 , cores=8 => number of threads is calculated as cores/chains

The number of threads for a chain is calculated as cores/chains. If available, cores CPU cores are used for sampling. If the number is not divisible warn the user (with floor or ceil forcores/chains) or stop execution.

  • D: chains=4 , cores=4 , threads_per_chain=2

Contrary to A, this means that the the maximum allowed CPU cores in use is specified by cores. In this example we would first run 2 chains each with 2 threads, followed by the remaining two chains.

Of the current ones I prefer D. I like it because we do not have to deal with warnings/stops and the maximum used cores is specified by cores which seems more obvious than A. It works for our current approach (all chains get the same amount of threads) and at first glance I feel like its also future proof for if/when we get shared warmup or just support for multiple chains within a single executable

That would allow “resource stealing” and chains with uneven numbers of threads. chains=4 , cores=5 , threads_per_chain=2 would in that context mean: use 5 cores, but not more than 2 threads per chain. Omitting the threads_per_chain would mean: do what you want with the 5 cores. This might be what @wds15 was looking for.

2 Likes

Yeah, I view this from the perspective of having our Intel TBB threadpool. From this angle it makes sense to just say what the total resources are. Thus “cores” should specify the total number of cores being used, I think.

I like @avehtari suggestion and I also wonder if we would need also a simultaneous_chains option as suggested?

The thing is, we need to say what the totals (chains, CPUs) and what we want to run simultaneously. Maybe it’s better to have two additional options (threads_per_chain & simultaneous_chains) with sensible defaults, which gives users full control if they really want to.

(admittedly I would prefer num_chains & num_cores for the totals, and threads_per_chain/ cores_per_chain & simultaneous_chains for what we use as it executes… but that’s neglecting where we come from)

Lets go with @avehtari’s suggestion then: chains=4 , cores=4 , threads_per_chain=2 meaning 4 cores will be used at maximum, with 2 threads per chain.

simultaneous_chains can be added later if we find it to be useful. It doesn’t really change the meaning of chains , cores or threads_per_chain.

4 Likes

@rok_cesnovar - how exactly can you limit threads per chain via CmdStan?

asking myself this after answering Shira’s question about running CmdStan here: Cmdstanr reduce sum case study, but: unused argument (threads = TRUE)

CmdStanPy is using Python’s subprocess module to spawn new processes - this is limited by cores, not threads. if we compile with STAN_THREADS=TRUE, then TBB theading handles threads, but I don’t see anywhere where you get to actually specify how many threads.
how does threads_per_chain work in CmdStanR?

You have to define the environment variable STAN_NUM_THREADS to the desired number of threads for the chain you start.

EDIT adding a proper way of defining it via the services would be great, but so far the services are thread agnostic.

2 Likes

threads_per_chain seems to be ignored in the latest version of the sampling function (0.0.0.9004). Before cmdstanr was updated, set_num_threads() worked fine but now using that function throws an error. @wds15, do I need to set STAN_NUM_THREADS separately from threads_per_chain?

1 Like

Can you post your call of $sample()? Set num threads was deprecated in favor of the new interface.

The following now only engages 4 cores:

fit <- m$sample(data = dat,
                adapt_delta = 0.8,
                max_treedepth = 10,
                chains = 4,
                parallel_chains = 4,
                iter_warmup = 500,
                iter_sampling = 1000,
                threads_per_chain = 15,
                seed = 123,
                refresh = 15
)

whereas in the previous version setting set_num_threads(15) engaged 60 cores (4 * 15).

You no longer need cores, you did not specify the number of chains though.

It should be
chains = 4,
parallel_chains = 4,
threads_per_chain = 15
and no cores.

Can you try that? Will check in the morning if we introduced any bugs. Its totally possible.

I edited my sample, accidentally typed it in the forum incorrectly.

Ok, that does look right. Will check tomorrow and get back to you. For now call

Sys.setenv(“STAN_NUM_THREADS”=15)

before the sample call and that should hopefully fix it.

That worked. Thanks for getting the quick workaround!

1 Like

The fix was merged (cmdstanr 0.0.0.9005). Thanks again for the report!

2 Likes