I’ve noticed some peculiar behavior for a relatively simple model whose likelihood is computed via reduce-sum: The larger the dataset, the lower the computational resource usage, both on the GPU via OpenCL, and on the CPU via TBB. In fact, for a very large dataset, performance using the GPU is even worse than without any parallelization. This is counter to what I expected, since larger chunks of data should keep individual cores busy for longer. Also, the copy load to and from the GPU that the Windows task manager displays is never above 1%, so it hardly seems like that would be the bottleneck. This problem does not seem to be tied to a specific hardware configuration, as both my personal Windows computer, as well as two large (disjunct) Linux machines exhibit the same behavior.

Can this really all be attributed to the overhead caused by copying data back and forth between the master and its worker threads?

Is this a common behavior, and if so, is there any way to improve it?

Is there any more Stan-specific information for improving parallelization efficiency beyond the official documentation?

For me, this poses a major obstacle, because for large datasets (100k observations, 10 variables), the estimator is running effectively on a single CPU core per chain.

For reference, I’ve simply rewritten the multivariate probit model from the documentation using reduce-sum for the likelihood

```
functions {
int sum2d(array[,] int a) {
int s = 0;
for (i in 1 : size(a)) {
s += sum(a[i]);
}
return s;
}
real partial_sum_lpdf(array[] matrix comb, int start, int end, vector beta,
matrix lchol, int D) {
array[end - start + 1] vector[D] xb;
for (i in 1 : (end - start + 1)) {
xb[i] = comb[i, : , 2 : (D + 2)] * beta;
}
return multi_normal_cholesky_lupdf(comb[ : , : , 1] | xb, lchol);
}
}
data {
int<lower=1> D;
int<lower=1> N;
array[N, D] int<lower=0, upper=1> y;
// array[N] matrix[D, D + K] x;
matrix[N * D, D + 1] x;
int M;
}
transformed data {
int<lower=0> N_pos;
array[sum2d(y)] int<lower=1, upper=N> n_pos;
array[size(n_pos)] int<lower=1, upper=D> d_pos;
int<lower=0> N_neg;
array[(N * D) - size(n_pos)] int<lower=1, upper=N> n_neg;
array[size(n_neg)] int<lower=1, upper=D> d_neg;
N_pos = size(n_pos);
N_neg = size(n_neg);
{
int i;
int j;
i = 1;
j = 1;
for (n in 1 : N) {
for (d in 1 : D) {
if (y[n, d] == 1) {
n_pos[i] = n;
d_pos[i] = d;
i += 1;
} else {
n_neg[j] = n;
d_neg[j] = d;
j += 1;
}
}
}
}
}
parameters {
vector[D + 1] beta;
cholesky_factor_corr[D] L_Omega;
vector<lower=0>[N_pos] z_pos;
vector<upper=0>[N_neg] z_neg;
}
transformed parameters {
array[N] vector[D] z;
profile("utilparsepos") {
for (n in 1 : N_pos) {
z[n_pos[n], d_pos[n]] = z_pos[n];
}
}
profile("utilparseneg") {
for (n in 1 : N_neg) {
z[n_neg[n], d_neg[n]] = z_neg[n];
}
}
array[N] matrix[D, D + 2] comb;
profile("combine") {
for (i in 1 : N) {
comb[i] = append_col(z[i], x[((i - 1) * D + 1) : (i * D), : ]);
}
}
}
model {
profile("params") {
L_Omega ~ lkj_corr_cholesky(10);
beta ~ normal(-0.5, 0.5);
}
profile("util") {
target += reduce_sum(partial_sum_lupdf, comb, M, beta, L_Omega, D);
}
}
generated quantities {
corr_matrix[D] Omega = multiply_lower_tri_self_transpose(L_Omega);
}
```