Help with reduce_sum

ahartikainen · July 31, 2020, 6:36am

Yes, these work (True, None).

So now the problem is on macOS?

ahartikainen · July 31, 2020, 9:24am

Tested the same model with PyStan3 and after fixing some small bugs (large data error) threading works as it should.

import stan

model_code = """
functions {
  real partial_sum(int[] slice_n_redcards,
                   int start, int end,
                   int[] n_games,
                   vector rating,
                   vector beta) {
    return binomial_logit_lpmf(slice_n_redcards |
                               n_games[start:end],
                               beta[1] + beta[2] * rating[start:end]);
  }
}
data {
  int<lower=0> N;
  int<lower=0> n_redcards[N];
  int<lower=0> n_games[N];
  vector[N] rating;
  int<lower=1> grainsize;
}
parameters {
  vector[2] beta;
}
model {

  beta[1] ~ normal(0, 10);
  beta[2] ~ normal(0, 1);

  target += reduce_sum(partial_sum, n_redcards, grainsize,
                       n_games, rating, beta);
}
"""
import pandas as pd
data = pd.read_csv("./RedcardData.csv")
data = data.dropna(subset=["rater1"])

stan_data = {
    "N" : data.shape[0],
    "n_redcards" : data["redCards"].values,
    "n_games" : data["games"].values,
    "rating" : data["rater1"].values,
    "grainsize" : 1,
}

posterior = stan.build(model_code, data=stan_data)

import os
os.environ["STAN_NUM_THREADS"] = "4"

# one chain to make sure threading works
fit = posterior.sample(num_chains=1, num_samples=1000)

mitzimorris · July 31, 2020, 3:18pm

just submitted a PR to fix this in CmdStanPy - https://github.com/stan-dev/cmdstanpy/pull/258

Logic: sample method takes arguments chains , parallel_chains , threads_per_chains .

A single chain will always run number of requested threads - allowing user to oversubscribe CPUs,
but if parallel_chains * threads_per_chains > CPUs available, number of parallel chains will
be decremented until this condition is met, until we’re only running 1 chain at a time.

This follows CmdStanR - there was a lot of discussion and work done to get the names and logic correct - I think that this PR follows the logic - @rok_cesnovar, @wds15, @bbbales2 - is this correct?

rok_cesnovar · July 31, 2020, 3:54pm

The logic in cmdstanr is that parallel_chains determines how many chains we run simultaneously.

The only edge case is if parallel_chains > chains in which case we just do

parallel_chains = chains

because parallel_chains > chains just makes no sense.

The actual cores used at each time is parallel_chains * threads_per_chains If the user wishes to oversubscribe, we let them do it.

The logic before was that we defined chains and cores and then calculated parallel_chains = floor(cores/chains). But we got rid of the cores arg.

ssp3nc3r · July 31, 2020, 4:53pm

In my early tests, oversubscribing did not cause problems, but seemed to slow computation, at least on the models I tried. I assume the reason is the tbb scheduler gets a little behind the threading requests, or something like that, but I haven’t dug deeply into how tbb works under the hood.

wds15 · July 31, 2020, 5:16pm

I think it’s all about cpu cache. The cache is already full without oversubscription in most cases.

mitzimorris · July 31, 2020, 11:29pm

I’ve worked through the long discussions on Discourse - Help with naming threading argument - specifically Aki’s comment Help with naming threading argument - #10 by avehtari which presume a machine with 8 available CPUs. I’m not sure what cores meand - maybe there’s a difference between R and Python? CmdStanPy uses a the concurrent.futures module which handles the calls to the subprocess module.

The sample method used to have args chains and cores - we’ve renamed cores to parallel_chains. The old logic was the if parallel_chains is greater than number of CPUs, set parallel_chains to number of CPUs. The question is whether or not to take threads_per_chain into account as well.

From the discussions here and the CmdStanR GitHub issue, I’m a little unclear what CmdStanR does - was the final decision that if user specifies both parallel_chains and threads_per_chain then that many chains are run in parallel, no matter what the number of available CPUs?

This is somewhat dangerous, because code takes on a life of its own - if someone codes up an analysis and runs it on their 32 core workstation with chains=6, parallel_chains=6, threads_per_chain=5, all is fine. If someone else with a laptop an Intel dual-core processor which really only has 2 CPUs (Intel’s hyperthreading reports that is has 4 CPUs), and runs the script as written, what is the best thing to do?

The PR I’ve submitted tries to adjust down the number of parallel_chains. I’ll use the reduce_sum case study and do some timing experiments on the various machines I have access to.

rok_cesnovar · August 1, 2020, 11:26am

There is more discussion here: https://github.com/stan-dev/cmdstanr/pull/185
Starting with Improve threading interface by rok-cesnovar · Pull Request #185 · stan-dev/cmdstanr · GitHub but the last name change was done after Sebastian’s comment here

Cores in rstan and cores in cmdstanr before the refactor both meant parallel_chains.

Yes.

That is true, that code will not be performant on the 2 core processor. The user can either decrease the threads_per_chain or parallel_chains. What happens with parallel_chains = 1, threads_per_chains = 10 on a 2 core CPU? Do we not allow it? Do we override threads_per_chain (I don’t like that)?

The better solution to this example would be that the example defines threads_per_chain and parallel_chain based on the available cores. If the example authors care about the performance of the code on 30 and 2 core CPUs.

The opposite case that won’t work if we impose a limit is if

parallel_chains*threads_per_chain > number of actual CPU cores

And sometimes we want that. Granted, not your everyday workflow, but neither is going from 30 to 2 cores.

The other argument is: do we want the interface to be “smart” about this case. Cmdstan would not mind 6 chains with STAN_NUM_THREADS=5 on a two core CPU.

mitzimorris · August 1, 2020, 12:49pm

many thanks - I’ve come around to your point of view - we’ll do what the user asks - if they specify the number of chains they want to run in parallel, we’ll make it so.

that is allowed. the code prints a warning via the logger - will make sure the language isn’t too prescriptive, because we’re not in a position to prescribe.

mitzimorris · August 4, 2020, 3:40pm

CmdStanPy fix for specifying threads_per_chain to sample method is now available on PyPi: https://pypi.org/project/cmdstanpy/

nerpa · August 4, 2020, 3:43pm

Does it mean there is no more need to define the following?

import os
os.environ["STAN_NUM_THREADS"] = "num_threads "

mitzimorris · August 4, 2020, 3:56pm

yes, and there’s doc too! https://cmdstanpy.readthedocs.io/en/latest/sample.html#example-high-level-parallelization-with-reduce-sum

nerpa · August 4, 2020, 7:23pm

Thank you!

Topic		Replies	Views
Reduce_sum parallelisation issue Modeling cmdstanr , multivariate-normal	12	907	February 24, 2022
Model with reduce_sum takes too long Modeling	28	1281	December 28, 2020
Reduce_sum performance Modeling performance , paralellization	15	1218	September 28, 2020
Parallelization in cmdstanr: using reduce_sum() on a multinomial likelihood Modeling cmdstanr , multinomial-response	14	1070	January 9, 2024
Reduce_sum error Modeling	4	790	May 13, 2020

Help with reduce_sum

Related Topics