MPI shard scaling

#1

To try out MPI support, I’m using the following toy model (cobbled together from this thread: Linear, parallell regression):

functions {
    vector bl_glm(vector sigma_beta, vector theta, real[] xs, int[] xi) {
        int J = xi[1];
        int K = xi[2];
        real lp;
        real sigma = sigma_beta[1];
        vector[K] beta = sigma_beta[2:(K + 1)];
        vector[J] y = to_vector(xs[1:J]);

        lp = normal_lpdf(y | to_matrix(xs[(J + 1):(J * (K + 1))], J, K) * beta, sigma);
        
        return [lp]';
    }
}

data {
    int N;
    int K;
    int shards;

    vector[N] y;
    matrix[N, K] X;
}

transformed data {
    vector[0] theta[shards];
    int<lower=0> J = N / shards; // observation per shard

    real x_r[shards, J * (K + 1)];
    int x_i[shards, 2];
    
    {
        int pos = 1;
        for (k in 1:shards) {
            int end = pos + J - 1;
            x_r[k] = to_array_1d(append_col(y[pos:end], X[pos:end,]));
            x_i[k, 1] = J;
            x_i[k, 2] = K;
        }
    }
}

parameters {
    vector[K] beta;
    real<lower=0> sigma;
}

model {
    beta ~ normal(0, 2);
    sigma ~ normal(0, 2);
    target += sum(map_rect(bl_glm, append_row(sigma, beta), theta, x_r, x_i));
}

The data generating process is just y = X\beta + N(0, 1), where X has 1000 observations and 50 predictors. With one shard and a single core, 2000 iterations take 1m50s (called with mpirun -np 1 ...)

I realize the problem is not meant for parallelization, but with 5 shards and 5 cores (mpirun -np 5...), I see a nice speedup with a runtime of around 40 seconds. What surprises me is that if I further increase the number of shards/cores, the performance reduces drastically, with a runtime of a bit over an hour for 50 shards/50 cores and almost 3 hours for 100 cores.

There is no specific problem I’m trying to solve, I’m just interested in what’s going on behind the scenes. I expected the communication overhead to be constant, with a similar runtime for 50 and 100 cores. Has this something to with how the automatic differentiation works?

#2

Likely 5/5 puts you at the sweet spot of load balance, after that latency becomes noticeable.

What surprises me is that if I further increase the number of shards/cores, the performance reduces drastically, with a runtime of a bit over an hour for 50 shards/50 cores and almost 3 hours for 100 cores.

I wouldn’t say the performance hit is “drastic”, as 5 vs 50 is a big difference. Maybe more data points such as 10/10, 20/20 draw a better picture.

#3

Assignments like this:

vector[K] beta = sigma_beta[2:(K + 1)];
vector[J] y = to_vector(xs[1:J]);
lp = normal_lpdf(y | to_matrix(xs[(J + 1):(J * (K + 1))], J, K) * beta, sigma);

Will turn the y vector into var's internally… which you want to avoid as this is data only. So prefer to write

lp = normal_lpdf(xs[1:J] | to_matrix(xs[(J + 1):(J * (K + 1))], J, K) * beta, sigma);

Similar comments apply to the rest (be careful in the handling of data).