Cmdstan multi-thread: is it working?

I run cmdstan with multithreads, but when I examine htop how much each CPU is using, I see that only 4 or 8 are occupied (see attached image). I think that it is not really runing multi-thread. Is there anything else to do other than requiring it in the running command? (see below)

models/gp/gp1 sample num_samples=$nsamples num_warmup=$nwarmups num_chains=$nchains adapt delta=$adapt algorithm=hmc engine=nuts max_depth=$max_depth data file=models/gp/$name.json num_threads=$nthreads 
output file=models/notebooks/results/"$name"-results.csv diagnostic_file=models/notebooks/results/"$name"-diagnostic.csv random seed=$seed
  • Operating System: Ubuntu 20
  • CmdStan Version: v2.33.1 (13 September 2023)

Thank you very much for this and for outstanding job!! Ezequiel.

Hi, could you share an example of your Stan code?

1 Like

Yes sure! It is a bit long, but I’ll transcript it literal for the purposes of avoiding missing something. I think that this is what you want. Thank you! Ezequiel

data {
  int<lower=1> m;  // steps in the discretization
  int<lower=1> N;  // data points
  array[N] int<lower=1, upper=m> score1;  // b-tagging score for jet#1
  array[N] int<lower=1, upper=m> score2;  // b-tagging score for jet#2
  array[N] int<lower=1, upper=m> score3;  // b-tagging score for jet#3
  array[N] int<lower=1, upper=m> score4;  // b-tagging score for jet#4
  vector[m-1] muj;                  // central value of prior b-tagging distribution for j-jets
  vector[m-1] mub;                  // central value of prior b-tagging distribution for b-jets
  real<lower=0> permutation_factor;
  real<lower=0> mu_sigma;
  real<lower=0> sigma_sigma;
  real<lower=0> mu_correlation;
  real<lower=0> sigma_correlation;

parameters {
  ordered[2] y5;   // this parameter will reinforce correct labelling, avoiding label switch. Because the 5th bin class0 i s greater than class1 always
  vector[m-2] yj_remain;   // posterior samples of b-tagging distribution for j-jets
  vector[m-2] yb_remain;   // posterior samples of b-tagging distribution for b-jets
  simplex[3] theta; // misture coefficient for the 3 classes: bbbb, bbjj (in any order) & jjjj
  real<lower=0> sigma;              // Covariance matrix parameter
  real<lower=0> correlation;        // Covariance matrix parameter

transformed parameters {
  vector[m-1]  yj;
  vector[m-1]  yb;

  yb[1:4]= yb_remain[1:4];
  yb[5] = y5[1];
  yb[6:m-1] = yb_remain[5:m-2];

  yj[1:4]= yj_remain[1:4];
  yj[5] = y5[2];
  yj[6:m-1] = yj_remain[5:m-2];


model {
  sigma ~ normal(mu_sigma, sigma_sigma);
  correlation ~ normal(mu_correlation, sigma_correlation);
  theta ~ dirichlet([1,1,1]);
  vector[3] lp;
  vector[6] lp2;
  matrix[m-1, m-1] K;
  for (i in 1:m-1)
     for (j in 1:m-1)
       K[i, j] = sigma * exp( - pow((abs(i-j)/(correlation)),2));
  yj ~ multi_normal(muj, K);
  yb ~ multi_normal(mub, K);
  for (Ni in 1:N)
     lp2[1] = log(permutation_factor) + log_softmax(yj)[score1[Ni]] + log_softmax(yj)[score2[Ni]] + log_softmax(yb)[score3[Ni]] + log_softmax(yb)[score4[Ni]];
     lp2[2] = log(permutation_factor) + log_softmax(yj)[score1[Ni]] + log_softmax(yb)[score2[Ni]] + log_softmax(yj)[score3[Ni]] + log_softmax(yb)[score4[Ni]];
     lp2[3] = log(permutation_factor) + log_softmax(yj)[score1[Ni]] + log_softmax(yb)[score2[Ni]] + log_softmax(yb)[score3[Ni]] + log_softmax(yj)[score4[Ni]];
     lp2[4] = log(permutation_factor) + log_softmax(yb)[score1[Ni]] + log_softmax(yj)[score2[Ni]] + log_softmax(yj)[score3[Ni]] + log_softmax(yb)[score4[Ni]];
     lp2[5] = log(permutation_factor) + log_softmax(yb)[score1[Ni]] + log_softmax(yj)[score2[Ni]] + log_softmax(yb)[score3[Ni]] + log_softmax(yj)[score4[Ni]];
     lp2[6] = log(permutation_factor) + log_softmax(yb)[score1[Ni]] + log_softmax(yb)[score2[Ni]] + log_softmax(yj)[score3[Ni]] + log_softmax(yj)[score4[Ni]];

     lp[1] = log_softmax(yj)[score1[Ni]] + log_softmax(yj)[score2[Ni]] + log_softmax(yj)[score3[Ni]] + log_softmax(yj)[score4[Ni]];
     lp[2] = log_sum_exp(lp2);
     lp[3] = log_softmax(yb)[score1[Ni]] + log_softmax(yb)[score2[Ni]] + log_softmax(yb)[score3[Ni]] + log_softmax(yb)[score4[Ni]];

     target += log_mix(theta, lp);

Edited by @jsocolar for syntax highlighting

Because you are not using any within-chain parallelization, I would not expect the process to use more threads than there are chains. Since you’re requesting only 4 chains, only 4 out of the 20 threads you are requesting have anything to do.

1 Like

Sorry, let me add a few things that could help:

  • It is a mixture model in 4-dimensions
  • There are 2 unknown distributions (b and j) which are arbitrary, but continuous, and this is why we use multinormal
  • Each one of the 4 dimensions can come from any of these 2 distributions
  • There are 3 possible classes that state how the 4 dimensions are sampled: jjjj, bbbb, or jjbb (in any order, this is why the permutation)

Hopefully I helped with this clarification.

Thanks, Ezequiel.

Hi Brian,

I think that you are correct! The link you’ve sent me goes to general parallelization, I’m wondering which would be within-chain-parallelization. Is that there’s no other way that working with the Reduce-sum and Map-rect options? Or is there something maybe not to efficient, but more straightfoward? (I mean an option that simply says that try to parallelize within chains…)

Thank you, Ezequiel.

It is trivial to parallelize any Stan model to use separate cores to run each chain. It is nontrivial to parallelize Stan models to use multiple cores for each chain (within-chain parallelization), which is the only way to see further speedups across more cores than the number of chains you are running.

Within-chain parallelization requires writing a Stan program that does some of its computation in parallel. The Stan functions to do that are reduce_sum and map_rect. The former is more straightforward to use.

Thank you for the comment and advice. In fact, I’ve been reading and I see some difficulty, but OK, not impossible.

I also notice that a mixture-model in particular has the extra-effort that does not match at all the Bernoulli-logit example in the docs. I’ve found this example:

I’ll follow it and see what happens.

Best, thanks, Ezequiel.

I’ve made it to work with reduce_sum, all threads working hard and simultaneously now! I love STAN!!

It is a 4-dimensional mixture model of 3 classes!

Thanks guys all who helped me, great community!

1 Like