# Parallelising matrix multiplications

I’m fitting multiple latent squared-exponential GPs on a shared regular grid as follows (the actual model also has covariates and uses `poisson_log_glm` but it doesn’t matter):

``````data {
int K; // number of groups
int M; // grid size
int N; // number of observations

array[N] int                   y; // response
array[N] int<lower=1, upper=M> x; // grid point
array[N] int<lower=1, upper=K> g; // group
}

parameters {
real<lower=0.0>                sigma;  // GP marginal std dev
real<lower=0.0>                lambda; // GP length-scale
array[K] vector[M]             eta;
}

transformed parameters {
array[K] vector[M]             gamma; // GP realizations

{
matrix[M, M] K; // covariance matrix
matrix[M, M] L; // its Cholesky factor

K = gp_exp_quad_cov(linspaced_array(M, 1.0, 1.0*M), sigma, lambda);
for (i in 1:M){
K[i, i] += 1e-8;
}
L = cholesky_decompose(K);

// *** the dominant bit
for (i in 1:K){
gamma[i] = L*eta[i];
}
// ***
}
}

model {
vector[N] lp;

for (i in 1:N){
lp[i] = gamma[g[i], x[i]];
}

sigma ~ std_normal();
lambda ~ inv_gamma(2, 50);

for (i in 1:K){
eta[i] ~ std_normal();
}

y ~ poisson_log(lp);
}
``````

I’m testing this with N ~ 60’000, K ~ 500 and M ~ 100. Profiling indicates that runtime is dominated by the L*eta multiplications (marked in the code), and wall clock times are such that I cannot scale it to my full data. I’ve short chains, good treedepths, and I implemented multiplication with a loop (not shown) to take advantage of L being lower-triangular, but it’s still too slow.

Is there anything left I can do to speed this up? In particular, can these multiplications be parallelized?

Have you tried replacing the arrays of vectors with matrices and replacing the loop over matrix-vector multiplies with one matrix-matrix multiply? I could be missing something but it seems like a reasonable thing to do for this model

This won’t add any parallelization or anything, but it should really shrink down the expression graph needed for autodiff

2 Likes

I can certainly try! But I thought that what’s slow here is literally the multiplications (the relevant profile statement wrapped just the loop), and switching to a matrix-matrix design won’t change their number. Am I misunderstanding something here?

For large matrices and operations try using the `--O1` stancflag, it allows for much more efficient handling. Additionally, if you have access to a discrete GPU, I’d recommend trying that out here - since we have GPU-accelerated matrix multiplication.

There are also a number of other tools/approaches for optimising Stan’s speed, @ave has put together a great introduction and summary here: Options for improving Stan sampling speed – The Stan Blog

2 Likes

Thanks Andrew and Brian for your suggestions. I rewrote the model to use matrices, but it didn’t help noticeably. My understanding is that the array-of-struct feature provided by the stanc `-O1` flag doesn’t help if the code isn’t fully vectorized (and I have to index unfortunately to get `lp`) and in any case I have had it on already. I will have no GPU in production. Switching to multi-threaded OpenBLAS as described by @avehtari (many thanks for that writeup) provided a healthy speedup though and I can have 400 samples (250 warmup + 150 sampling) from the full model in ~24h which is acceptable (and I can probably do better by reusing the mass matrix). The code is now as follows:

``````data {
int K; // number of groups
int M; // grid size
int N; // number of observations

array[N] int                   y; // response
array[N] int<lower=1, upper=M> x; // grid point
array[N] int<lower=1, upper=K> g; // group
}

parameters {
real<lower=0.0>                sigma;  // GP marginal std dev
real<lower=0.0>                lambda; // GP length-scale
vector[M*K]                    eta;
}

transformed parameters {
matrix[M, K]                   gamma; // GP realizations

{
matrix[M, M] C; // covariance matrix
matrix[M, M] L; // its Cholesky factor

// profile: covariance
C = gp_exp_quad_cov(linspaced_array(M, 1.0, 1.0*M), sigma, lambda);
for (i in 1:M){
C[i, i] += 1e-8;
}

// profile: cholesky
L = cholesky_decompose(C);

// profile: transform
gamma = L*to_matrix(eta, M, K)
}
}

model {
vector[N] lp;

// profile: intercept
for (i in 1:N){
lp[i] = gamma[g[i], x[i]];
}

// profile: prior-base
sigma ~ std_normal();
lambda ~ inv_gamma(2, 50);

// profile: prior-eta
eta ~ std_normal()

// profile: glm
y ~ poisson_log(lp);
``````

The full model has K ~ 20’000, M ~ 100 and N ~ 1.8 million. The profiler says:

``````  name      thread_id total_time forward_time reverse_time chain_stack no_chain_stack autodiff_calls no_autodiff_calls
<chr>         <dbl>      <dbl>        <dbl>        <dbl>       <dbl>          <dbl>          <dbl>             <dbl>
1 cholesky    1.40e14     136.         78.8         57.2       1.30e 5      724537352         130172               151
2 covarian…   1.40e14      25.8        23.5          2.24      1.38e 7      724407180         130172               151
3 glm         1.40e14   18040.      16652.        1388.        1.30e 5              0         130172                 1
4 intercept   1.40e14   24288.      18130         6159.        2.41e11              0         130172                 1
5 prior-ba…   1.40e14       1.09        0.913        0.181     5.21e 5              0         130172                 1
6 prior-eta   1.40e14    3329.       2253.        1075.        1.30e 5              0         130172                 1
7 transform   1.40e14   13961.       8529.        5432.        1.30e 5   223718806080         130172               151
``````

so now the runtime is dominated by the construction of the linear predictor. That loop could in principle be over the number of distinct group-gridpoint combinations (~250’000) and `lp` constructed in slices with `rep_vector` – do you think this will help?

Let me also say a big thank you for these tools, including to @rok_cesnovar &al. for the profiler. It’s a game-changer to be able to do rigorous Bayesian inference at this scale.

3 Likes

Suggestion: instead of constructing lp in a loop, flatten it using to_vector, then sort/subset it to match y using an indexing array. And finally, don’t store the result of those operations in a variable but instead use the operations in-place in your likelihood expression.

btw, “lp” is often used in folks’ stan code, and in the examples in docs, as the name of a variable collecting explicit computation of the log-probability, so you might want to use something more verbose for your linear predictor to avoid confusion

1 Like

Do you mean something like

``````y ~ poisson_log(to_vector(gamma)[idxs])
``````

where `idxs` is an integer array that picks out the correct `gamma` for every `y`, and could be computed once in `transformed data`?

1 Like

Yup!

Excellent – will try this in the next round of optimisations. But could you break down for me where the expected speedup comes from? As far as I understand, the indexing remains, so is it just from avoiding the copying of elements one-by-one to a new variable?

One index operation instead of two, removal of a loop, removal of a variable declaration (which can affect auto-diff)

2 Likes

Another idea is to use rows_dot_product or columns_dot_product. Define
parameter eta as `matrix[M, K] eta` with sampling statement
`to_vector(eta) ~ std_normal();`
Then use
`y ~ poisson_log(rows_dot_product(L, eta[ , g]'));`
`rows_dot_product` is capable of using the GPU, thus the shift
should gain some performance benefits.

1 Like