Help with reduce_sum

Yes, these work (True, None).

So now the problem is on macOS?

Tested the same model with PyStan3 and after fixing some small bugs (large data error) threading works as it should.

import stan

model_code = """
functions {
  real partial_sum(int[] slice_n_redcards,
                   int start, int end,
                   int[] n_games,
                   vector rating,
                   vector beta) {
    return binomial_logit_lpmf(slice_n_redcards |
                               beta[1] + beta[2] * rating[start:end]);
data {
  int<lower=0> N;
  int<lower=0> n_redcards[N];
  int<lower=0> n_games[N];
  vector[N] rating;
  int<lower=1> grainsize;
parameters {
  vector[2] beta;
model {

  beta[1] ~ normal(0, 10);
  beta[2] ~ normal(0, 1);

  target += reduce_sum(partial_sum, n_redcards, grainsize,
                       n_games, rating, beta);
import pandas as pd
data = pd.read_csv("./RedcardData.csv")
data = data.dropna(subset=["rater1"])

stan_data = {
    "N" : data.shape[0],
    "n_redcards" : data["redCards"].values,
    "n_games" : data["games"].values,
    "rating" : data["rater1"].values,
    "grainsize" : 1,

posterior =, data=stan_data)

import os
os.environ["STAN_NUM_THREADS"] = "4"

# one chain to make sure threading works
fit = posterior.sample(num_chains=1, num_samples=1000)

just submitted a PR to fix this in CmdStanPy -

Logic: sample method takes arguments chains , parallel_chains , threads_per_chains .

A single chain will always run number of requested threads - allowing user to oversubscribe CPUs,
but if parallel_chains * threads_per_chains > CPUs available, number of parallel chains will
be decremented until this condition is met, until we’re only running 1 chain at a time.

This follows CmdStanR - there was a lot of discussion and work done to get the names and logic correct - I think that this PR follows the logic - @rok_cesnovar, @wds15, @bbbales2 - is this correct?

The logic in cmdstanr is that parallel_chains determines how many chains we run simultaneously.

The only edge case is if parallel_chains > chains in which case we just do

parallel_chains = chains

because parallel_chains > chains just makes no sense.

The actual cores used at each time is parallel_chains * threads_per_chains If the user wishes to oversubscribe, we let them do it.

The logic before was that we defined chains and cores and then calculated parallel_chains = floor(cores/chains). But we got rid of the cores arg.

In my early tests, oversubscribing did not cause problems, but seemed to slow computation, at least on the models I tried. I assume the reason is the tbb scheduler gets a little behind the threading requests, or something like that, but I haven’t dug deeply into how tbb works under the hood.

1 Like

I think it’s all about cpu cache. The cache is already full without oversubscription in most cases.

I’ve worked through the long discussions on Discourse - Help with naming threading argument - specifically Aki’s comment Help with naming threading argument which presume a machine with 8 available CPUs. I’m not sure what cores meand - maybe there’s a difference between R and Python? CmdStanPy uses a the concurrent.futures module which handles the calls to the subprocess module.

The sample method used to have args chains and cores - we’ve renamed cores to parallel_chains. The old logic was the if parallel_chains is greater than number of CPUs, set parallel_chains to number of CPUs. The question is whether or not to take threads_per_chain into account as well.

From the discussions here and the CmdStanR GitHub issue, I’m a little unclear what CmdStanR does - was the final decision that if user specifies both parallel_chains and threads_per_chain then that many chains are run in parallel, no matter what the number of available CPUs?

This is somewhat dangerous, because code takes on a life of its own - if someone codes up an analysis and runs it on their 32 core workstation with chains=6, parallel_chains=6, threads_per_chain=5, all is fine. If someone else with a laptop an Intel dual-core processor which really only has 2 CPUs (Intel’s hyperthreading reports that is has 4 CPUs), and runs the script as written, what is the best thing to do?

The PR I’ve submitted tries to adjust down the number of parallel_chains. I’ll use the reduce_sum case study and do some timing experiments on the various machines I have access to.


There is more discussion here:
Starting with but the last name change was done after Sebastian’s comment here

Cores in rstan and cores in cmdstanr before the refactor both meant parallel_chains.


That is true, that code will not be performant on the 2 core processor. The user can either decrease the threads_per_chain or parallel_chains. What happens with parallel_chains = 1, threads_per_chains = 10 on a 2 core CPU? Do we not allow it? Do we override threads_per_chain (I don’t like that)?

The better solution to this example would be that the example defines threads_per_chain and parallel_chain based on the available cores. If the example authors care about the performance of the code on 30 and 2 core CPUs.

The opposite case that won’t work if we impose a limit is if

parallel_chains*threads_per_chain > number of actual CPU cores

And sometimes we want that. Granted, not your everyday workflow, but neither is going from 30 to 2 cores.

The other argument is: do we want the interface to be “smart” about this case. Cmdstan would not mind 6 chains with STAN_NUM_THREADS=5 on a two core CPU.


many thanks - I’ve come around to your point of view - we’ll do what the user asks - if they specify the number of chains they want to run in parallel, we’ll make it so.

that is allowed. the code prints a warning via the logger - will make sure the language isn’t too prescriptive, because we’re not in a position to prescribe.

CmdStanPy fix for specifying threads_per_chain to sample method is now available on PyPi:


Does it mean there is no more need to define the following?

import os
os.environ["STAN_NUM_THREADS"] = "num_threads "

yes, and there’s doc too!

1 Like

Thank you!