MPI shard scaling

To try out MPI support, I’m using the following toy model (cobbled together from this thread: Linear, parallell regression):

functions {
    vector bl_glm(vector sigma_beta, vector theta, real[] xs, int[] xi) {
        int J = xi[1];
        int K = xi[2];
        real lp;
        real sigma = sigma_beta[1];
        vector[K] beta = sigma_beta[2:(K + 1)];
        vector[J] y = to_vector(xs[1:J]);

        lp = normal_lpdf(y | to_matrix(xs[(J + 1):(J * (K + 1))], J, K) * beta, sigma);
        
        return [lp]';
    }
}

data {
    int N;
    int K;
    int shards;

    vector[N] y;
    matrix[N, K] X;
}

transformed data {
    vector[0] theta[shards];
    int<lower=0> J = N / shards; // observation per shard

    real x_r[shards, J * (K + 1)];
    int x_i[shards, 2];
    
    {
        int pos = 1;
        for (k in 1:shards) {
            int end = pos + J - 1;
            x_r[k] = to_array_1d(append_col(y[pos:end], X[pos:end,]));
            x_i[k, 1] = J;
            x_i[k, 2] = K;
        }
    }
}

parameters {
    vector[K] beta;
    real<lower=0> sigma;
}

model {
    beta ~ normal(0, 2);
    sigma ~ normal(0, 2);
    target += sum(map_rect(bl_glm, append_row(sigma, beta), theta, x_r, x_i));
}

The data generating process is just y = X\beta + N(0, 1), where X has 1000 observations and 50 predictors. With one shard and a single core, 2000 iterations take 1m50s (called with mpirun -np 1 ...)

I realize the problem is not meant for parallelization, but with 5 shards and 5 cores (mpirun -np 5...), I see a nice speedup with a runtime of around 40 seconds. What surprises me is that if I further increase the number of shards/cores, the performance reduces drastically, with a runtime of a bit over an hour for 50 shards/50 cores and almost 3 hours for 100 cores.

There is no specific problem I’m trying to solve, I’m just interested in what’s going on behind the scenes. I expected the communication overhead to be constant, with a similar runtime for 50 and 100 cores. Has this something to with how the automatic differentiation works?

Likely 5/5 puts you at the sweet spot of load balance, after that latency becomes noticeable.

What surprises me is that if I further increase the number of shards/cores, the performance reduces drastically, with a runtime of a bit over an hour for 50 shards/50 cores and almost 3 hours for 100 cores.

I wouldn’t say the performance hit is “drastic”, as 5 vs 50 is a big difference. Maybe more data points such as 10/10, 20/20 draw a better picture.

Assignments like this:

vector[K] beta = sigma_beta[2:(K + 1)];
vector[J] y = to_vector(xs[1:J]);
lp = normal_lpdf(y | to_matrix(xs[(J + 1):(J * (K + 1))], J, K) * beta, sigma);

Will turn the y vector into var's internally… which you want to avoid as this is data only. So prefer to write

lp = normal_lpdf(xs[1:J] | to_matrix(xs[(J + 1):(J * (K + 1))], J, K) * beta, sigma);

Similar comments apply to the rest (be careful in the handling of data).

1 Like

Thank you all for your replies!
I finally understood your point on handling the data.

The solution to my initial problem was pretty dumb, it was just an MPI configuration error which caused all processes to spawn on a single node (if anyone has the same problem, you can check that by running mpirun -np 20 hostname)

Fixing that I now get much more reasonable runtimes:
comparison_plot.pdf (6.2 KB)

This is for the Warfarin model here: Map_rect threading.
nxalien2 is a single machine with an intel xeon E5-2630 v4 (10 cores, 2 threads),
pool are pretty old Intel 3770 (4 cores, 2 threads) and superpool are newer intel 8700 (6 cores, 2 threads).

Out of interest, what is the setup you used for your MPI benchmark in the link above?

Again, thank you all for your kind help
Daniel

We are using here these CPUs:

model name	: Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
stepping	: 4
cpu MHz		: 2500.000
cache size	: 25600 KB

Comparing my results (just 0.5 minutes on 16cores) with yours suggests that we seem to have a better setup. I am really not an expert on this, but maybe you can tweak your MPI performance via a better configuration? Again… not an expert here…but you should ensure that shared memory is used on a local machine. For most of my benchmarks I am using OpenMPI installed on our cluster.

In any case - it looks as if this becomes useful for you and I am glad my comments did sink in and helped.

1 Like

In any case - it looks as if this becomes useful for you and I am glad my comments did sink in and helped.

Just let me say that the possibility to parallelize a single chain is super super awesome!

I’m pretty sure this has something to do with our MPI configuration. If I find some spare time I’m going to spin up an AWS instance and see how it scales there.
I actually look forward to build a little modelling farm in my basement to remotely send my models to, this sounds fun.