**tl;dr** I’m trying to use multiple threads when estimating a multi-logit regression and all my tries are slower than the Stan Users Guide implemented model, using 3 threads per chain and confirming that the parallelization is working.

Hello, all!

After reading some discussions about “reduce-sum” implementation, I understood that the slicing approach is effective in parameters vector/matrix and, in cases where it is not applicable, data slice is acceptable.

Based on the codes presented for a probit model and on the Minimal Example instructions, I was (or I think I was) able to write a code that runs 3 threads per chain but is slower than the one without parallelization.

As I’m forcing sum-to-zero constraint in my parameters, the updated multi-logi regression model without parallelization is

```
data {
int K; // number of categories
int N; // sample size
int D; // design matrix column dimension (intercept + D-1 covariates)
array[N] int y;
matrix[N, D] x;
}
parameters {
matrix [D, K-1] beta_raw; // Coefficients for each category K (except the reference)
}
transformed parameters {
matrix [D, K] beta;
for(d in 1:D){
beta[d] = append_col(-sum(beta_raw[d]), beta_raw[d]);
}
}
model {
matrix[N, K] eta= x * beta;
to_vector(beta_raw) ~ normal(0, 5);
for (n in 1:N) {
target += categorical_logit_lpmf(y[n] | eta[n]');
}
}
```

The fastest version I was able to create for a parallelized model calculate/construct linear predictor \eta matrix in the `stan partial_sum`

function, according to what is shown in this answer:

```
functions{
real partial_sum_lpmf(array[] int Y_slice,
int start, int end,
matrix beta,
matrix x,
int K){
real lp = 0;
int N = (end-start+1);
matrix[N, K] eta_slice = x[start:end] * beta;
for(i in 1:N)
lp += categorical_logit_lupmf(Y_slice[i] | eta_slice[i]');
return lp;
}
}
data {
int K;
int N;
int D;
array[N] int y;
matrix[N, D] x;
}
parameters {
matrix [D, K-1] beta_raw; // Coefficients for each category K (except the reference)
}
transformed parameters {
matrix [D, K] beta;
for(d in 1:D){
beta[d] = append_col(-sum(beta_raw[d]), beta_raw[d]);
}
}
model {
to_vector(beta_raw) ~ normal(0, 5);
target += reduce_sum(partial_sum_lpmf, y, 1, beta, x, K);
}
```

I’m running them on a server, so GPU capacity and number of cores are not a problem, but the reduced model is ~60% slower than the non reduced:

The estimated parameters are equal, which indicates that the parallelized code is making the right calculations, at least:

I’ve tried other parallelization approaches, but they were worse :/

The set of parameters I used for all runnings was this (with objects updated according to the models):

The codes and data simulation files are attached.

All kinds of help, comments, or suggestions about how to make the parallelized code the fastest would be appreciated :)

Thanks in advance!

P.S.: As an additional question, I saw much discussion about how to interpret parameters when the identifiability problem is fixed by forcing one of the \beta vector to be zero. Does anyone know how to interpret, or recover the original parameters that generated data?

Thanks so much!

non.stan (629 Bytes)

reduce.stan (981 Bytes)

reduce2.stan (924 Bytes)

reduce3.stan (936 Bytes)

reduce4.stan (950 Bytes)

Data.R (1.8 KB)