I’m fitting multiple latent squared-exponential GPs on a shared regular grid as follows (the actual model also has covariates and uses `poisson_log_glm`

but it doesn’t matter):

```
data {
int K; // number of groups
int M; // grid size
int N; // number of observations
array[N] int y; // response
array[N] int<lower=1, upper=M> x; // grid point
array[N] int<lower=1, upper=K> g; // group
}
parameters {
real<lower=0.0> sigma; // GP marginal std dev
real<lower=0.0> lambda; // GP length-scale
array[K] vector[M] eta;
}
transformed parameters {
array[K] vector[M] gamma; // GP realizations
{
matrix[M, M] K; // covariance matrix
matrix[M, M] L; // its Cholesky factor
K = gp_exp_quad_cov(linspaced_array(M, 1.0, 1.0*M), sigma, lambda);
for (i in 1:M){
K[i, i] += 1e-8;
}
L = cholesky_decompose(K);
// *** the dominant bit
for (i in 1:K){
gamma[i] = L*eta[i];
}
// ***
}
}
model {
vector[N] lp;
for (i in 1:N){
lp[i] = gamma[g[i], x[i]];
}
sigma ~ std_normal();
lambda ~ inv_gamma(2, 50);
for (i in 1:K){
eta[i] ~ std_normal();
}
y ~ poisson_log(lp);
}
```

I’m testing this with N ~ 60’000, K ~ 500 and M ~ 100. Profiling indicates that runtime is dominated by the L*eta multiplications (marked in the code), and wall clock times are such that I cannot scale it to my full data. I’ve short chains, good treedepths, and I implemented multiplication with a loop (not shown) to take advantage of L being lower-triangular, but it’s still too slow.

Is there anything left I can do to speed this up? In particular, can these multiplications be parallelized?

Have you tried replacing the arrays of vectors with matrices and replacing the loop over matrix-vector multiplies with one matrix-matrix multiply? I could be missing something but it seems like a reasonable thing to do for this model

This won’t add any parallelization or anything, but it should really shrink down the expression graph needed for autodiff

2 Likes

I can certainly try! But I thought that what’s slow here is literally the multiplications (the relevant profile statement wrapped just the loop), and switching to a matrix-matrix design won’t change their number. Am I misunderstanding something here?

For large matrices and operations try using the `--O1`

stancflag, it allows for much more efficient handling. Additionally, if you have access to a discrete GPU, I’d recommend trying that out here - since we have GPU-accelerated matrix multiplication.

There are also a number of other tools/approaches for optimising Stan’s speed, @ave has put together a great introduction and summary here: Options for improving Stan sampling speed – The Stan Blog

2 Likes

Thanks Andrew and Brian for your suggestions. I rewrote the model to use matrices, but it didn’t help noticeably. My understanding is that the array-of-struct feature provided by the stanc `-O1`

flag doesn’t help if the code isn’t fully vectorized (and I have to index unfortunately to get `lp`

) and in any case I have had it on already. I will have no GPU in production. Switching to multi-threaded OpenBLAS as described by @avehtari (many thanks for that writeup) provided a healthy speedup though and I can have 400 samples (250 warmup + 150 sampling) from the full model in ~24h which is acceptable (and I can probably do better by reusing the mass matrix). The code is now as follows:

```
data {
int K; // number of groups
int M; // grid size
int N; // number of observations
array[N] int y; // response
array[N] int<lower=1, upper=M> x; // grid point
array[N] int<lower=1, upper=K> g; // group
}
parameters {
real<lower=0.0> sigma; // GP marginal std dev
real<lower=0.0> lambda; // GP length-scale
vector[M*K] eta;
}
transformed parameters {
matrix[M, K] gamma; // GP realizations
{
matrix[M, M] C; // covariance matrix
matrix[M, M] L; // its Cholesky factor
// profile: covariance
C = gp_exp_quad_cov(linspaced_array(M, 1.0, 1.0*M), sigma, lambda);
for (i in 1:M){
C[i, i] += 1e-8;
}
// profile: cholesky
L = cholesky_decompose(C);
// profile: transform
gamma = L*to_matrix(eta, M, K)
}
}
model {
vector[N] lp;
// profile: intercept
for (i in 1:N){
lp[i] = gamma[g[i], x[i]];
}
// profile: prior-base
sigma ~ std_normal();
lambda ~ inv_gamma(2, 50);
// profile: prior-eta
eta ~ std_normal()
// profile: glm
y ~ poisson_log(lp);
```

The full model has K ~ 20’000, M ~ 100 and N ~ 1.8 million. The profiler says:

```
name thread_id total_time forward_time reverse_time chain_stack no_chain_stack autodiff_calls no_autodiff_calls
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 cholesky 1.40e14 136. 78.8 57.2 1.30e 5 724537352 130172 151
2 covarian… 1.40e14 25.8 23.5 2.24 1.38e 7 724407180 130172 151
3 glm 1.40e14 18040. 16652. 1388. 1.30e 5 0 130172 1
4 intercept 1.40e14 24288. 18130 6159. 2.41e11 0 130172 1
5 prior-ba… 1.40e14 1.09 0.913 0.181 5.21e 5 0 130172 1
6 prior-eta 1.40e14 3329. 2253. 1075. 1.30e 5 0 130172 1
7 transform 1.40e14 13961. 8529. 5432. 1.30e 5 223718806080 130172 151
```

so now the runtime is dominated by the construction of the linear predictor. That loop could in principle be over the number of distinct group-gridpoint combinations (~250’000) and `lp`

constructed in slices with `rep_vector`

– do you think this will help?

Let me also say a big thank you for these tools, including to @rok_cesnovar &al. for the profiler. It’s a game-changer to be able to do rigorous Bayesian inference at this scale.

3 Likes

Suggestion: instead of constructing lp in a loop, flatten it using to_vector, then sort/subset it to match y using an indexing array. And finally, don’t store the result of those operations in a variable but instead use the operations in-place in your likelihood expression.

btw, “lp” is often used in folks’ stan code, and in the examples in docs, as the name of a variable collecting explicit computation of the log-probability, so you might want to use something more verbose for your linear predictor to avoid confusion

1 Like

Do you mean something like

```
y ~ poisson_log(to_vector(gamma)[idxs])
```

where `idxs`

is an integer array that picks out the correct `gamma`

for every `y`

, and could be computed once in `transformed data`

?

1 Like

Excellent – will try this in the next round of optimisations. But could you break down for me where the expected speedup comes from? As far as I understand, the indexing remains, so is it just from avoiding the copying of elements one-by-one to a new variable?

One index operation instead of two, removal of a loop, removal of a variable declaration (which can affect auto-diff)

2 Likes

Another idea is to use rows_dot_product or columns_dot_product. Define

parameter eta as `matrix[M, K] eta`

with sampling statement

`to_vector(eta) ~ std_normal();`

Then use

`y ~ poisson_log(rows_dot_product(L, eta[ , g]'));`

`rows_dot_product`

is capable of using the GPU, thus the shift

should gain some performance benefits.

1 Like