Yes, these work (True
, None
).
So now the problem is on macOS?
Yes, these work (True
, None
).
So now the problem is on macOS?
Tested the same model with PyStan3 and after fixing some small bugs (large data error) threading works as it should.
import stan
model_code = """
functions {
real partial_sum(int[] slice_n_redcards,
int start, int end,
int[] n_games,
vector rating,
vector beta) {
return binomial_logit_lpmf(slice_n_redcards |
n_games[start:end],
beta[1] + beta[2] * rating[start:end]);
}
}
data {
int<lower=0> N;
int<lower=0> n_redcards[N];
int<lower=0> n_games[N];
vector[N] rating;
int<lower=1> grainsize;
}
parameters {
vector[2] beta;
}
model {
beta[1] ~ normal(0, 10);
beta[2] ~ normal(0, 1);
target += reduce_sum(partial_sum, n_redcards, grainsize,
n_games, rating, beta);
}
"""
import pandas as pd
data = pd.read_csv("./RedcardData.csv")
data = data.dropna(subset=["rater1"])
stan_data = {
"N" : data.shape[0],
"n_redcards" : data["redCards"].values,
"n_games" : data["games"].values,
"rating" : data["rater1"].values,
"grainsize" : 1,
}
posterior = stan.build(model_code, data=stan_data)
import os
os.environ["STAN_NUM_THREADS"] = "4"
# one chain to make sure threading works
fit = posterior.sample(num_chains=1, num_samples=1000)
just submitted a PR to fix this in CmdStanPy - https://github.com/stan-dev/cmdstanpy/pull/258
Logic: sample method takes arguments chains
, parallel_chains
, threads_per_chains
.
A single chain will always run number of requested threads - allowing user to oversubscribe CPUs,
but if parallel_chains * threads_per_chains > CPUs available, number of parallel chains will
be decremented until this condition is met, until weāre only running 1 chain at a time.
This follows CmdStanR - there was a lot of discussion and work done to get the names and logic correct - I think that this PR follows the logic - @rok_cesnovar, @wds15, @bbbales2 - is this correct?
The logic in cmdstanr is that parallel_chains
determines how many chains we run simultaneously.
The only edge case is if parallel_chains > chains in which case we just do
parallel_chains = chains
because parallel_chains > chains just makes no sense.
The actual cores used at each time is parallel_chains * threads_per_chains If the user wishes to oversubscribe, we let them do it.
The logic before was that we defined chains and cores and then calculated parallel_chains = floor(cores/chains). But we got rid of the cores arg.
In my early tests, oversubscribing did not cause problems, but seemed to slow computation, at least on the models I tried. I assume the reason is the tbb scheduler gets a little behind the threading requests, or something like that, but I havenāt dug deeply into how tbb works under the hood.
I think itās all about cpu cache. The cache is already full without oversubscription in most cases.
Iāve worked through the long discussions on Discourse - Help with naming threading argument - specifically Akiās comment Help with naming threading argument - #10 by avehtari which presume a machine with 8 available CPUs. Iām not sure what cores
meand - maybe thereās a difference between R and Python? CmdStanPy uses a the concurrent.futures
module which handles the calls to the subprocess
module.
The sample
method used to have args chains
and cores
- weāve renamed cores
to parallel_chains
. The old logic was the if parallel_chains
is greater than number of CPUs, set parallel_chains
to number of CPUs. The question is whether or not to take threads_per_chain
into account as well.
From the discussions here and the CmdStanR GitHub issue, Iām a little unclear what CmdStanR does - was the final decision that if user specifies both parallel_chains
and threads_per_chain
then that many chains are run in parallel, no matter what the number of available CPUs?
This is somewhat dangerous, because code takes on a life of its own - if someone codes up an analysis and runs it on their 32 core workstation with chains=6
, parallel_chains=6
, threads_per_chain=5
, all is fine. If someone else with a laptop an Intel dual-core processor which really only has 2 CPUs (Intelās hyperthreading reports that is has 4 CPUs), and runs the script as written, what is the best thing to do?
The PR Iāve submitted tries to adjust down the number of parallel_chains. Iāll use the reduce_sum
case study and do some timing experiments on the various machines I have access to.
There is more discussion here: https://github.com/stan-dev/cmdstanr/pull/185
Starting with Improve threading interface by rok-cesnovar Ā· Pull Request #185 Ā· stan-dev/cmdstanr Ā· GitHub but the last name change was done after Sebastianās comment here
Cores in rstan and cores in cmdstanr before the refactor both meant parallel_chains.
Yes.
That is true, that code will not be performant on the 2 core processor. The user can either decrease the threads_per_chain or parallel_chains. What happens with parallel_chains = 1, threads_per_chains = 10 on a 2 core CPU? Do we not allow it? Do we override threads_per_chain (I donāt like that)?
The better solution to this example would be that the example defines threads_per_chain and parallel_chain based on the available cores. If the example authors care about the performance of the code on 30 and 2 core CPUs.
The opposite case that wonāt work if we impose a limit is if
parallel_chains*threads_per_chain > number of actual CPU cores
And sometimes we want that. Granted, not your everyday workflow, but neither is going from 30 to 2 cores.
The other argument is: do we want the interface to be āsmartā about this case. Cmdstan would not mind 6 chains with STAN_NUM_THREADS=5 on a two core CPU.
many thanks - Iāve come around to your point of view - weāll do what the user asks - if they specify the number of chains they want to run in parallel, weāll make it so.
that is allowed. the code prints a warning via the logger - will make sure the language isnāt too prescriptive, because weāre not in a position to prescribe.
CmdStanPy fix for specifying threads_per_chain
to sample
method is now available on PyPi: https://pypi.org/project/cmdstanpy/
Does it mean there is no more need to define the following?
import os
os.environ["STAN_NUM_THREADS"] = "num_threads "
yes, and thereās doc too! https://cmdstanpy.readthedocs.io/en/latest/sample.html#example-high-level-parallelization-with-reduce-sum
Thank you!