Cmdstan multi-thread: is it working?

Ezequiel_Alvarez · November 7, 2023, 6:48pm

I run cmdstan with multithreads, but when I examine htop how much each CPU is using, I see that only 4 or 8 are occupied (see attached image). I think that it is not really runing multi-thread. Is there anything else to do other than requiring it in the running command? (see below)

#!/usr/bin/bash
name="07-11-2023.13.10hs"
nsamples=2000
nwarmups=2000
nchains=4
nthreads=20
seed=0
adapt=0.87
max_depth=15
models/gp/gp1 sample num_samples=$nsamples num_warmup=$nwarmups num_chains=$nchains adapt delta=$adapt algorithm=hmc engine=nuts max_depth=$max_depth data file=models/gp/$name.json num_threads=$nthreads 
output file=models/notebooks/results/"$name"-results.csv diagnostic_file=models/notebooks/results/"$name"-diagnostic.csv random seed=$seed

Operating System: Ubuntu 20
CmdStan Version: v2.33.1 (13 September 2023)

Thank you very much for this and for outstanding job!! Ezequiel.

ahartikainen · November 7, 2023, 6:51pm

Hi, could you share an example of your Stan code?

Ezequiel_Alvarez · November 7, 2023, 8:16pm

Yes sure! It is a bit long, but I’ll transcript it literal for the purposes of avoiding missing something. I think that this is what you want. Thank you! Ezequiel

data {
  int<lower=1> m;  // steps in the discretization
  int<lower=1> N;  // data points
  array[N] int<lower=1, upper=m> score1;  // b-tagging score for jet#1
  array[N] int<lower=1, upper=m> score2;  // b-tagging score for jet#2
  array[N] int<lower=1, upper=m> score3;  // b-tagging score for jet#3
  array[N] int<lower=1, upper=m> score4;  // b-tagging score for jet#4
  vector[m-1] muj;                  // central value of prior b-tagging distribution for j-jets
  vector[m-1] mub;                  // central value of prior b-tagging distribution for b-jets
  real<lower=0> permutation_factor;
  real<lower=0> mu_sigma;
  real<lower=0> sigma_sigma;
  real<lower=0> mu_correlation;
  real<lower=0> sigma_correlation;
}

parameters {
  ordered[2] y5;   // this parameter will reinforce correct labelling, avoiding label switch. Because the 5th bin class0 i s greater than class1 always
  vector[m-2] yj_remain;   // posterior samples of b-tagging distribution for j-jets
  vector[m-2] yb_remain;   // posterior samples of b-tagging distribution for b-jets
  simplex[3] theta; // misture coefficient for the 3 classes: bbbb, bbjj (in any order) & jjjj
  real<lower=0> sigma;              // Covariance matrix parameter
  real<lower=0> correlation;        // Covariance matrix parameter
}

transformed parameters {
  vector[m-1]  yj;
  vector[m-1]  yb;

  yb[1:4]= yb_remain[1:4];
  yb[5] = y5[1];
  yb[6:m-1] = yb_remain[5:m-2];

  yj[1:4]= yj_remain[1:4];
  yj[5] = y5[2];
  yj[6:m-1] = yj_remain[5:m-2];

}

model {
  sigma ~ normal(mu_sigma, sigma_sigma);
  correlation ~ normal(mu_correlation, sigma_correlation);
  theta ~ dirichlet([1,1,1]);
  vector[3] lp;
  vector[6] lp2;
  matrix[m-1, m-1] K;
  for (i in 1:m-1)
     for (j in 1:m-1)
       K[i, j] = sigma * exp( - pow((abs(i-j)/(correlation)),2));
  yj ~ multi_normal(muj, K);
  yb ~ multi_normal(mub, K);
  for (Ni in 1:N)
     {
     lp2[1] = log(permutation_factor) + log_softmax(yj)[score1[Ni]] + log_softmax(yj)[score2[Ni]] + log_softmax(yb)[score3[Ni]] + log_softmax(yb)[score4[Ni]];
     lp2[2] = log(permutation_factor) + log_softmax(yj)[score1[Ni]] + log_softmax(yb)[score2[Ni]] + log_softmax(yj)[score3[Ni]] + log_softmax(yb)[score4[Ni]];
     lp2[3] = log(permutation_factor) + log_softmax(yj)[score1[Ni]] + log_softmax(yb)[score2[Ni]] + log_softmax(yb)[score3[Ni]] + log_softmax(yj)[score4[Ni]];
     lp2[4] = log(permutation_factor) + log_softmax(yb)[score1[Ni]] + log_softmax(yj)[score2[Ni]] + log_softmax(yj)[score3[Ni]] + log_softmax(yb)[score4[Ni]];
     lp2[5] = log(permutation_factor) + log_softmax(yb)[score1[Ni]] + log_softmax(yj)[score2[Ni]] + log_softmax(yb)[score3[Ni]] + log_softmax(yj)[score4[Ni]];
     lp2[6] = log(permutation_factor) + log_softmax(yb)[score1[Ni]] + log_softmax(yb)[score2[Ni]] + log_softmax(yj)[score3[Ni]] + log_softmax(yj)[score4[Ni]];

     lp[1] = log_softmax(yj)[score1[Ni]] + log_softmax(yj)[score2[Ni]] + log_softmax(yj)[score3[Ni]] + log_softmax(yj)[score4[Ni]];
     lp[2] = log_sum_exp(lp2);
     lp[3] = log_softmax(yb)[score1[Ni]] + log_softmax(yb)[score2[Ni]] + log_softmax(yb)[score3[Ni]] + log_softmax(yb)[score4[Ni]];

     target += log_mix(theta, lp);
     }
}

Edited by @jsocolar for syntax highlighting

WardBrian · November 7, 2023, 8:47pm

Because you are not using any within-chain parallelization, I would not expect the process to use more threads than there are chains. Since you’re requesting only 4 chains, only 4 out of the 20 threads you are requesting have anything to do.

Ezequiel_Alvarez · November 7, 2023, 8:49pm

Sorry, let me add a few things that could help:

It is a mixture model in 4-dimensions
There are 2 unknown distributions (b and j) which are arbitrary, but continuous, and this is why we use multinormal
Each one of the 4 dimensions can come from any of these 2 distributions
There are 3 possible classes that state how the 4 dimensions are sampled: jjjj, bbbb, or jjbb (in any order, this is why the permutation)

Hopefully I helped with this clarification.

Thanks, Ezequiel.

Ezequiel_Alvarez · November 7, 2023, 9:14pm

Hi Brian,

I think that you are correct! The link you’ve sent me goes to general parallelization, I’m wondering which would be within-chain-parallelization. Is that there’s no other way that working with the Reduce-sum and Map-rect options? Or is there something maybe not to efficient, but more straightfoward? (I mean an option that simply says that try to parallelize within chains…)

Thank you, Ezequiel.

jsocolar · November 8, 2023, 3:11am

It is trivial to parallelize any Stan model to use separate cores to run each chain. It is nontrivial to parallelize Stan models to use multiple cores for each chain (within-chain parallelization), which is the only way to see further speedups across more cores than the number of chains you are running.

Within-chain parallelization requires writing a Stan program that does some of its computation in parallel. The Stan functions to do that are reduce_sum and map_rect. The former is more straightforward to use.

Ezequiel_Alvarez · November 8, 2023, 4:44pm

Thank you for the comment and advice. In fact, I’ve been reading and I see some difficulty, but OK, not impossible.

I also notice that a mixture-model in particular has the extra-effort that does not match at all the Bernoulli-logit example in the docs. I’ve found this example:

gist.github.com

https://gist.github.com/ito4303/33bf2d192d121e257e25f97e6d48df73#file-test_code-stan

n-mixture_reduce_sum.R

# The simplest possible N-mixture model from Section 6.3 of
# Applied Hierarchical Modeling in Ecology.

library(cmdstanr)
set_cmdstan_path("/usr/local/cmdstan")

# generate simulated data

lambda <- 2.5 # mean abundance
p <- 0.4      # detection probability

This file has been truncated. show original

test_code.stan

functions {
  /**
   * Return log probability of N-mixture model for a site
   * 
   * @param count    Count in a site
   * @param max_n    Maximum abundance
   * @param lambda   Mean abundance
   * @param p        Detection probability
   *
   * @return         Log probability

This file has been truncated. show original

I’ll follow it and see what happens.

Best, thanks, Ezequiel.

Ezequiel_Alvarez · November 10, 2023, 2:24pm

I’ve made it to work with reduce_sum, all threads working hard and simultaneously now! I love STAN!!

It is a 4-dimensional mixture model of 3 classes!

Thanks guys all who helped me, great community!

Topic		Replies	Views
Cmdstanr doesn't use mutltiple compute threads for each core CmdStan cmdstanr	5	1108	February 1, 2022
Cmdstanpy: multithreading issues (threads_per_chain) CmdStan cmdstanpy	2	517	December 13, 2023
CmdStan map_rect not showing speed-up CmdStan cmdstan	4	769	May 3, 2020
CmdStanPy and multithreading Modeling	10	1271	June 27, 2024
Within-chain parallelization not working with cmdstanr on linux server General cmdstanr	14	1039	November 10, 2021

Cmdstan multi-thread: is it working?

Related topics