# MPI shard scaling

#1

To try out MPI support, I’m using the following toy model (cobbled together from this thread: Linear, parallell regression):

functions {
vector bl_glm(vector sigma_beta, vector theta, real[] xs, int[] xi) {
int J = xi[1];
int K = xi[2];
real lp;
real sigma = sigma_beta[1];
vector[K] beta = sigma_beta[2:(K + 1)];
vector[J] y = to_vector(xs[1:J]);

lp = normal_lpdf(y | to_matrix(xs[(J + 1):(J * (K + 1))], J, K) * beta, sigma);

return [lp]';
}
}

data {
int N;
int K;
int shards;

vector[N] y;
matrix[N, K] X;
}

transformed data {
vector[0] theta[shards];
int<lower=0> J = N / shards; // observation per shard

real x_r[shards, J * (K + 1)];
int x_i[shards, 2];

{
int pos = 1;
for (k in 1:shards) {
int end = pos + J - 1;
x_r[k] = to_array_1d(append_col(y[pos:end], X[pos:end,]));
x_i[k, 1] = J;
x_i[k, 2] = K;
}
}
}

parameters {
vector[K] beta;
real<lower=0> sigma;
}

model {
beta ~ normal(0, 2);
sigma ~ normal(0, 2);
target += sum(map_rect(bl_glm, append_row(sigma, beta), theta, x_r, x_i));
}


The data generating process is just y = X\beta + N(0, 1), where X has 1000 observations and 50 predictors. With one shard and a single core, 2000 iterations take 1m50s (called with mpirun -np 1 ...)

I realize the problem is not meant for parallelization, but with 5 shards and 5 cores (mpirun -np 5...), I see a nice speedup with a runtime of around 40 seconds. What surprises me is that if I further increase the number of shards/cores, the performance reduces drastically, with a runtime of a bit over an hour for 50 shards/50 cores and almost 3 hours for 100 cores.

There is no specific problem I’m trying to solve, I’m just interested in what’s going on behind the scenes. I expected the communication overhead to be constant, with a similar runtime for 50 and 100 cores. Has this something to with how the automatic differentiation works?

#2

Likely 5/5 puts you at the sweet spot of load balance, after that latency becomes noticeable.

What surprises me is that if I further increase the number of shards/cores, the performance reduces drastically, with a runtime of a bit over an hour for 50 shards/50 cores and almost 3 hours for 100 cores.

I wouldn’t say the performance hit is “drastic”, as 5 vs 50 is a big difference. Maybe more data points such as 10/10, 20/20 draw a better picture.

#3

Assignments like this:

vector[K] beta = sigma_beta[2:(K + 1)];
vector[J] y = to_vector(xs[1:J]);
lp = normal_lpdf(y | to_matrix(xs[(J + 1):(J * (K + 1))], J, K) * beta, sigma);


Will turn the y vector into var's internally… which you want to avoid as this is data only. So prefer to write

lp = normal_lpdf(xs[1:J] | to_matrix(xs[(J + 1):(J * (K + 1))], J, K) * beta, sigma);


Similar comments apply to the rest (be careful in the handling of data).