Operating System: MacOS

Interface Version: 2.18.1

I’m trying to enjoy the benefits of the within-chain parallelization and have tried the example map_rect function in the user guide:

```
functions {
vector lr(vector beta, vector theta, real[] x, int[] y) {
real lp = bernoulli_logit_lpmf(y | beta[1] + to_vector(x) * beta[2]);
return [lp]';
}
}
data {
int y[12];
real x[12];
}
transformed data {
// K = 3 shards
int ys[3, 4] = { y[1:4], y[5:8], y[9:12] };
real xs[3, 4] = { x[1:4], x[5:8], x[9:12] };
vector[0] theta[3];
}
parameters {
vector[2] beta;
}
model {
beta ~ std_normal();
target += sum(map_rect(lr, beta, theta, xs, ys));
}
```

Using number of samples = 10,0000 and number of warmup = 2000 these are the sampling times reported:

STAN_NUM_THREADS = 1

1.21 seconds

STAN_NUM_THREADS = 4

8.51 seconds

With 1 thread, the Mac’s activity monitor will show 100% CPU utilization. With 4 threads, it shows ~160% utilization.

I’ve tried this in both CommandStan and RStan with similar results. Clearly I’m doing something wrong, but I’m not sure what.