To try out MPI support, I’m using the following toy model (cobbled together from this thread: Linear, parallell regression):
functions {
vector bl_glm(vector sigma_beta, vector theta, real[] xs, int[] xi) {
int J = xi[1];
int K = xi[2];
real lp;
real sigma = sigma_beta[1];
vector[K] beta = sigma_beta[2:(K + 1)];
vector[J] y = to_vector(xs[1:J]);
lp = normal_lpdf(y | to_matrix(xs[(J + 1):(J * (K + 1))], J, K) * beta, sigma);
return [lp]';
}
}
data {
int N;
int K;
int shards;
vector[N] y;
matrix[N, K] X;
}
transformed data {
vector[0] theta[shards];
int<lower=0> J = N / shards; // observation per shard
real x_r[shards, J * (K + 1)];
int x_i[shards, 2];
{
int pos = 1;
for (k in 1:shards) {
int end = pos + J - 1;
x_r[k] = to_array_1d(append_col(y[pos:end], X[pos:end,]));
x_i[k, 1] = J;
x_i[k, 2] = K;
}
}
}
parameters {
vector[K] beta;
real<lower=0> sigma;
}
model {
beta ~ normal(0, 2);
sigma ~ normal(0, 2);
target += sum(map_rect(bl_glm, append_row(sigma, beta), theta, x_r, x_i));
}
The data generating process is just y = X\beta + N(0, 1), where X has 1000 observations and 50 predictors. With one shard and a single core, 2000 iterations take 1m50s (called with mpirun -np 1 ...
)
I realize the problem is not meant for parallelization, but with 5 shards and 5 cores (mpirun -np 5...
), I see a nice speedup with a runtime of around 40 seconds. What surprises me is that if I further increase the number of shards/cores, the performance reduces drastically, with a runtime of a bit over an hour for 50 shards/50 cores and almost 3 hours for 100 cores.
There is no specific problem I’m trying to solve, I’m just interested in what’s going on behind the scenes. I expected the communication overhead to be constant, with a similar runtime for 50 and 100 cores. Has this something to with how the automatic differentiation works?