Hi Stan Community,
I’m trying to use reduce_sum
in my Stan model to parallelize computations across multiple threads. While the model runs without errors, I noticed in the Mac Activity Monitor that each chain only uses 1 thread, even though I configured threads_per_chain
to be 4
and set STAN_NUM_THREADS=4
. I’m using a MacBook Pro 2021, M1 Pro, 16 GB Ram, 8 CPU, Sequoia 15.2, CmdStan 2.32.1, CmdStanPy 1.1.0.
Here is the setup and code I’m using:
- bernoulli_reduce_sum.stan
functions {
real partial_sum(array[] int y_slice, int start, int end, real theta) {
return bernoulli_lpmf(y_slice | theta);
}
}
data {
int<lower=0> N;
int<lower=0, upper=1> y[N];
}
transformed data {
int grainsize = 1;
}
parameters {
real<lower=0, upper=1> theta;
}
model {
theta ~ beta(1,1); // uniform prior on interval 0,1
target += reduce_sum(partial_sum, y, grainsize, theta);
}
- Python Code
from cmdstanpy import CmdStanModel
# Set the number of threads
os.environ['STAN_NUM_THREADS'] = '4' # 4 threads per chain
# Paths to Stan model and data
stan_file = 'models/examples/bernoulli/bernoulli_reduce_sum.stan'
data_file = 'models/examples/bernoulli/bernoulli.data.json'
# Compile the model
model = CmdStanModel(stan_file=stan_file)
# Here I tried with the 'STAN_THREADS = True', but only run one CPU with a lot of threads.
# model = CmdStanModel(stan_file=stan_file_reduce_sum, cpp_options={"STAN_THREADS": True}, compile="force")
# Run sampling with multi-threading
fit = model.sample(
data=data_file,
chains=4, # Number of parallel chains
parallel_chains=4, # Each chain runs in its own process
threads_per_chain=4 # Number of threads per chain
)
The problem is in Mac Activity Monitor is shown that only 4 cores are used, and, when the STAN_THREADS=True parameter is set, it uses only 1 core, with 8 threads but is slower that the original run. The expected behaviour is to have all 8 cores running 8 chains, and each core, with 4 threads, is it really possible?
Thanks a lot!