Parallelising matrix multiplications

I’m fitting multiple latent squared-exponential GPs on a shared regular grid as follows (the actual model also has covariates and uses poisson_log_glm but it doesn’t matter):

data {
  int K; // number of groups
  int M; // grid size
  int N; // number of observations
  
  array[N] int                   y; // response
  array[N] int<lower=1, upper=M> x; // grid point
  array[N] int<lower=1, upper=K> g; // group
}

parameters {
  real<lower=0.0>                sigma;  // GP marginal std dev
  real<lower=0.0>                lambda; // GP length-scale
  array[K] vector[M]             eta;
}

transformed parameters {
  array[K] vector[M]             gamma; // GP realizations

  {
    matrix[M, M] K; // covariance matrix
    matrix[M, M] L; // its Cholesky factor

    K = gp_exp_quad_cov(linspaced_array(M, 1.0, 1.0*M), sigma, lambda);
    for (i in 1:M){
      K[i, i] += 1e-8;
    }  
    L = cholesky_decompose(K);

    // *** the dominant bit 
    for (i in 1:K){
        gamma[i] = L*eta[i];
    }
    // ***  
  }
}

model {
  vector[N] lp;
  
  for (i in 1:N){
    lp[i] = gamma[g[i], x[i]];
  }
  
  sigma ~ std_normal();
  lambda ~ inv_gamma(2, 50);

  for (i in 1:K){
    eta[i] ~ std_normal();
  }

  y ~ poisson_log(lp);
}

I’m testing this with N ~ 60’000, K ~ 500 and M ~ 100. Profiling indicates that runtime is dominated by the L*eta multiplications (marked in the code), and wall clock times are such that I cannot scale it to my full data. I’ve short chains, good treedepths, and I implemented multiplication with a loop (not shown) to take advantage of L being lower-triangular, but it’s still too slow.

Is there anything left I can do to speed this up? In particular, can these multiplications be parallelized?

Have you tried replacing the arrays of vectors with matrices and replacing the loop over matrix-vector multiplies with one matrix-matrix multiply? I could be missing something but it seems like a reasonable thing to do for this model

This won’t add any parallelization or anything, but it should really shrink down the expression graph needed for autodiff

2 Likes

I can certainly try! But I thought that what’s slow here is literally the multiplications (the relevant profile statement wrapped just the loop), and switching to a matrix-matrix design won’t change their number. Am I misunderstanding something here?

For large matrices and operations try using the --O1 stancflag, it allows for much more efficient handling. Additionally, if you have access to a discrete GPU, I’d recommend trying that out here - since we have GPU-accelerated matrix multiplication.

There are also a number of other tools/approaches for optimising Stan’s speed, @ave has put together a great introduction and summary here: Options for improving Stan sampling speed – The Stan Blog

2 Likes

Thanks Andrew and Brian for your suggestions. I rewrote the model to use matrices, but it didn’t help noticeably. My understanding is that the array-of-struct feature provided by the stanc -O1 flag doesn’t help if the code isn’t fully vectorized (and I have to index unfortunately to get lp) and in any case I have had it on already. I will have no GPU in production. Switching to multi-threaded OpenBLAS as described by @avehtari (many thanks for that writeup) provided a healthy speedup though and I can have 400 samples (250 warmup + 150 sampling) from the full model in ~24h which is acceptable (and I can probably do better by reusing the mass matrix). The code is now as follows:

data {
  int K; // number of groups
  int M; // grid size
  int N; // number of observations
  
  array[N] int                   y; // response
  array[N] int<lower=1, upper=M> x; // grid point
  array[N] int<lower=1, upper=K> g; // group
}

parameters {
  real<lower=0.0>                sigma;  // GP marginal std dev
  real<lower=0.0>                lambda; // GP length-scale
  vector[M*K]                    eta;
}

transformed parameters {
  matrix[M, K]                   gamma; // GP realizations

  {
    matrix[M, M] C; // covariance matrix
    matrix[M, M] L; // its Cholesky factor

    // profile: covariance
    C = gp_exp_quad_cov(linspaced_array(M, 1.0, 1.0*M), sigma, lambda);
    for (i in 1:M){
      C[i, i] += 1e-8;
    }  

    // profile: cholesky
    L = cholesky_decompose(C);

    // profile: transform   
    gamma = L*to_matrix(eta, M, K)
  }
}

model {
  vector[N] lp;
  
  // profile: intercept
  for (i in 1:N){
    lp[i] = gamma[g[i], x[i]];
  }
  
  // profile: prior-base
  sigma ~ std_normal();
  lambda ~ inv_gamma(2, 50);

  // profile: prior-eta
  eta ~ std_normal()

  // profile: glm
  y ~ poisson_log(lp);

The full model has K ~ 20’000, M ~ 100 and N ~ 1.8 million. The profiler says:

  name      thread_id total_time forward_time reverse_time chain_stack no_chain_stack autodiff_calls no_autodiff_calls
  <chr>         <dbl>      <dbl>        <dbl>        <dbl>       <dbl>          <dbl>          <dbl>             <dbl>
1 cholesky    1.40e14     136.         78.8         57.2       1.30e 5      724537352         130172               151
2 covarian…   1.40e14      25.8        23.5          2.24      1.38e 7      724407180         130172               151
3 glm         1.40e14   18040.      16652.        1388.        1.30e 5              0         130172                 1
4 intercept   1.40e14   24288.      18130         6159.        2.41e11              0         130172                 1
5 prior-ba…   1.40e14       1.09        0.913        0.181     5.21e 5              0         130172                 1
6 prior-eta   1.40e14    3329.       2253.        1075.        1.30e 5              0         130172                 1
7 transform   1.40e14   13961.       8529.        5432.        1.30e 5   223718806080         130172               151

so now the runtime is dominated by the construction of the linear predictor. That loop could in principle be over the number of distinct group-gridpoint combinations (~250’000) and lp constructed in slices with rep_vector – do you think this will help?

Let me also say a big thank you for these tools, including to @rok_cesnovar &al. for the profiler. It’s a game-changer to be able to do rigorous Bayesian inference at this scale.

3 Likes

Suggestion: instead of constructing lp in a loop, flatten it using to_vector, then sort/subset it to match y using an indexing array. And finally, don’t store the result of those operations in a variable but instead use the operations in-place in your likelihood expression.

btw, “lp” is often used in folks’ stan code, and in the examples in docs, as the name of a variable collecting explicit computation of the log-probability, so you might want to use something more verbose for your linear predictor to avoid confusion

1 Like

Do you mean something like

y ~ poisson_log(to_vector(gamma)[idxs])

where idxs is an integer array that picks out the correct gamma for every y, and could be computed once in transformed data?

1 Like

Yup!

Excellent – will try this in the next round of optimisations. But could you break down for me where the expected speedup comes from? As far as I understand, the indexing remains, so is it just from avoiding the copying of elements one-by-one to a new variable?

One index operation instead of two, removal of a loop, removal of a variable declaration (which can affect auto-diff)

2 Likes

Another idea is to use rows_dot_product or columns_dot_product. Define
parameter eta as matrix[M, K] eta with sampling statement
to_vector(eta) ~ std_normal();
Then use
y ~ poisson_log(rows_dot_product(L, eta[ , g]'));
rows_dot_product is capable of using the GPU, thus the shift
should gain some performance benefits.

1 Like