I’m implementing the weighted decomposition kernel. The implementation works, but it’s ~10x slower than a numba’d python/numpy implementation. What can I do to speed it up?

The weighted decomposition kernel is:

k(s, s') = \sum_{i=1}^L \left(S(s_i, s'_i) \sum_{l\in \text{nbs}(i)} S(s_l, s'_l) \right)

Where S is a substitution matrix.

```
functions {
real wdk(int[] x1, int[] x2, int[,] adj, int L, matrix S){
vector[L] K;
real subs[L + 1];
subs[L + 1] = 0;
for (k in 1:L){
subs[k] = S[x1[k], x2[k]];
}
for (k in 1:L){
K[k] = subs[k] * sum(subs[adj[k]]);
}
return sum(K);
}
matrix wd_cov(int[,] X1, int[,] X2, int[,] adj, int L, matrix S){
int n1 = size(X1);
int n2 = size(X2);
int x1[L];
int x2[L];
matrix[n1, n2] K;
vector[n1] K1;
row_vector[n2] K2;
for (i in 1:n1){
for (j in 1:n2){
K[i, j] = wdk(X1[i], X2[j], adj, L, S);
}
}
for (i in 1:n1){
K1[i] = wdk(X1[i], X1[i], adj, L, S);
}
for (i in 1:n2){
K2[i] = wdk(X2[i], X2[i], adj, L, S);
}
K1 = sqrt(K1);
K2 = sqrt(K2);
for (i in 1:n2){
K[:, i] = K[:, i] ./ K1;
}
for (i in 1:n1){
K[i] = K[i] ./ K2;
}
return K;
}
}
data {
int<lower=1> n1;
int<lower=1> n2;
int<lower=1> L;
int<lower=1> n_subs;
int<lower=2> D;
int X1[n1, L];
int X2[n2, L];
int adj[L, n_subs];
matrix[D, D] S;
}
generated quantities {
matrix[n1, n2] cov = wd_cov(X1, X2, adj, L, S);
}
```