I run cmdstan with multithreads, but when I examine htop how much each CPU is using, I see that only 4 or 8 are occupied (see attached image). I think that it is not really runing multi-thread. Is there anything else to do other than requiring it in the running command? (see below)

Yes sure! It is a bit long, but I’ll transcript it literal for the purposes of avoiding missing something. I think that this is what you want. Thank you! Ezequiel

data {
int<lower=1> m; // steps in the discretization
int<lower=1> N; // data points
array[N] int<lower=1, upper=m> score1; // b-tagging score for jet#1
array[N] int<lower=1, upper=m> score2; // b-tagging score for jet#2
array[N] int<lower=1, upper=m> score3; // b-tagging score for jet#3
array[N] int<lower=1, upper=m> score4; // b-tagging score for jet#4
vector[m-1] muj; // central value of prior b-tagging distribution for j-jets
vector[m-1] mub; // central value of prior b-tagging distribution for b-jets
real<lower=0> permutation_factor;
real<lower=0> mu_sigma;
real<lower=0> sigma_sigma;
real<lower=0> mu_correlation;
real<lower=0> sigma_correlation;
}
parameters {
ordered[2] y5; // this parameter will reinforce correct labelling, avoiding label switch. Because the 5th bin class0 i s greater than class1 always
vector[m-2] yj_remain; // posterior samples of b-tagging distribution for j-jets
vector[m-2] yb_remain; // posterior samples of b-tagging distribution for b-jets
simplex[3] theta; // misture coefficient for the 3 classes: bbbb, bbjj (in any order) & jjjj
real<lower=0> sigma; // Covariance matrix parameter
real<lower=0> correlation; // Covariance matrix parameter
}
transformed parameters {
vector[m-1] yj;
vector[m-1] yb;
yb[1:4]= yb_remain[1:4];
yb[5] = y5[1];
yb[6:m-1] = yb_remain[5:m-2];
yj[1:4]= yj_remain[1:4];
yj[5] = y5[2];
yj[6:m-1] = yj_remain[5:m-2];
}
model {
sigma ~ normal(mu_sigma, sigma_sigma);
correlation ~ normal(mu_correlation, sigma_correlation);
theta ~ dirichlet([1,1,1]);
vector[3] lp;
vector[6] lp2;
matrix[m-1, m-1] K;
for (i in 1:m-1)
for (j in 1:m-1)
K[i, j] = sigma * exp( - pow((abs(i-j)/(correlation)),2));
yj ~ multi_normal(muj, K);
yb ~ multi_normal(mub, K);
for (Ni in 1:N)
{
lp2[1] = log(permutation_factor) + log_softmax(yj)[score1[Ni]] + log_softmax(yj)[score2[Ni]] + log_softmax(yb)[score3[Ni]] + log_softmax(yb)[score4[Ni]];
lp2[2] = log(permutation_factor) + log_softmax(yj)[score1[Ni]] + log_softmax(yb)[score2[Ni]] + log_softmax(yj)[score3[Ni]] + log_softmax(yb)[score4[Ni]];
lp2[3] = log(permutation_factor) + log_softmax(yj)[score1[Ni]] + log_softmax(yb)[score2[Ni]] + log_softmax(yb)[score3[Ni]] + log_softmax(yj)[score4[Ni]];
lp2[4] = log(permutation_factor) + log_softmax(yb)[score1[Ni]] + log_softmax(yj)[score2[Ni]] + log_softmax(yj)[score3[Ni]] + log_softmax(yb)[score4[Ni]];
lp2[5] = log(permutation_factor) + log_softmax(yb)[score1[Ni]] + log_softmax(yj)[score2[Ni]] + log_softmax(yb)[score3[Ni]] + log_softmax(yj)[score4[Ni]];
lp2[6] = log(permutation_factor) + log_softmax(yb)[score1[Ni]] + log_softmax(yb)[score2[Ni]] + log_softmax(yj)[score3[Ni]] + log_softmax(yj)[score4[Ni]];
lp[1] = log_softmax(yj)[score1[Ni]] + log_softmax(yj)[score2[Ni]] + log_softmax(yj)[score3[Ni]] + log_softmax(yj)[score4[Ni]];
lp[2] = log_sum_exp(lp2);
lp[3] = log_softmax(yb)[score1[Ni]] + log_softmax(yb)[score2[Ni]] + log_softmax(yb)[score3[Ni]] + log_softmax(yb)[score4[Ni]];
target += log_mix(theta, lp);
}
}

Because you are not using any within-chain parallelization, I would not expect the process to use more threads than there are chains. Since you’re requesting only 4 chains, only 4 out of the 20 threads you are requesting have anything to do.

I think that you are correct! The link you’ve sent me goes to general parallelization, I’m wondering which would be within-chain-parallelization. Is that there’s no other way that working with the Reduce-sum and Map-rect options? Or is there something maybe not to efficient, but more straightfoward? (I mean an option that simply says that try to parallelize within chains…)

It is trivial to parallelize any Stan model to use separate cores to run each chain. It is nontrivial to parallelize Stan models to use multiple cores for each chain (within-chain parallelization), which is the only way to see further speedups across more cores than the number of chains you are running.

Within-chain parallelization requires writing a Stan program that does some of its computation in parallel. The Stan functions to do that are reduce_sum and map_rect. The former is more straightforward to use.

Thank you for the comment and advice. In fact, I’ve been reading and I see some difficulty, but OK, not impossible.

I also notice that a mixture-model in particular has the extra-effort that does not match at all the Bernoulli-logit example in the docs. I’ve found this example: