Hello again. My struggle with within-chain parallelization using `reduce_sum`

is unfortunately ramping up. I moved to using CmdStanR (after CmdStanPy did not work with multi threading), but when I run this model using `reduce_sum`

I have two problems:

(1) the model with `reduce_sum`

takes **twice** as much time as the model without `reduce_sum`

.

(2) even though I designed the model to have two threads and defined `threads_per_chain = 2`

, I see that I have **3 running threads**. Maybe these problems are related.

I am really stuck…

```
model {
.
.
.
for (n in 1:N) {
for (w in 1:W) {
if (w == 1) {
X[n,w] ~ normal(mu_prior_x,sigma_v);
}
else {
X[n, w] ~ normal((A * X[n, w-1] + B * U[n, w-1]), Q);
}
theta_pr[n, w] ~ normal(C[1,] * X[n, w],sigma_r);
int grainsize = 13;
vector[tr_max] a;
vector[tr_max] b;
vector[tr_max] c;
a[ :tr[n, w]] = data1[n, w, :tr[n, w]] ./ (1 + theta[n, w] * data2[n, w, :tr[n, w]]);
b[ :tr[n, w]] = data3[n, w, :tr[n, w]];
c[ :tr[n, w]] = a[ :tr[n, w]] - b[ :tr[n, w]];
target += reduce_sum(partial_sum, choice[n, w, :tr[n, w]], grainsize, c[:tr[n, w]], theta[n, w]);
}
}
}
```

cmdstan version 2.23.0

macOS Sierra

R version 3.3.2.