More efficient anisotropic calculation for Gaussian Process Kernel?

Hello! In Chapter 4 of Rasmussen and Williams (pg. 89) it mentions that
“Anisotropic versions of these isotropic covariance functions can be created by setting
r^2(x, x') = (x − x')^T M (x − x') for some positive semidefinite M.”

Assuming M is DxD, and x are D-dimensional vectors, and there are N of them, one could compute

for (i in 1:N-1) 
{
  Omega[i,i] = 0.0;
  for (j in (i+1):N) 
  {
    Omega[i,j] = quad_form_sym(M, (x[i]-x[j])');
    Omega[j,i] = Omega[i,j];
  }
}
Omega[N,N] = 0.0;

assuming M is symmetric positive definite.

My question is, how can this be vectorized? Is there a way of making a full matrix out of the x data and doing something more efficient than the double loop? Any suggestions are most welcome!

2 Likes

Okay update: If I know that M has Cholesky factor L so L L' = M. Then would the following be faster:

matrix[D,D] L_transpose = L'; // I assume this costs flops so we should do it once and not in the loop?
for (i in 1:N-1) 
{
  Omega[i,i] = 0.0;
  for (j in (i+1):N) 
  {
    Omega[i,j] = dot_self(L_transpose * (x[i]-x[j])');
    Omega[j,i] = Omega[i,j];
  }
}
Omega[N,N] = 0.0;

That is, does Stan take advantage of the lower triangular structure in the matrix-vector multiply?

1 Like

Hi,
sorry for not gettign to you earlier. I think this math stuff would be better handled by @stevebronder or @syclik, it goes a bit over my head. But I can answer a subquestion:

AFAIK Stan doesn’t do any optimization like this “magically”. There are some specialized functions to handle triangular matrices like mdivide_left_tri_low but I don’t think there is anything for multiplication.

However, implementing this yourself should be easy. Stan is compiled to C++ and pays very little penalty for loops (unlike R). Obviously this may not help much, especially if this is not a bottleneck in your model.

Hope that helps at least a little and hope others will be able to provide a bit better advice.

2 Likes

So @martinmodrak you are suggesting something like:

for (i in 1:N-1) {
      Omega[i,i] = 0.0;
      for (j in (i+1):N) {
        real val = 0.;
        Omega[i,j] = 0.;
        for (k in 1:D) {
          for (l in k:D) {
            val += L[k,l]*(x[i,l]-x[j,l]);
          }
          Omega[i,j] += square(val);
        }
        Omega[j,i] = Omega[i,j];
      }
    }
    Omega[N,N] = 0.0;

(if I have my math right). Where x is D dimensional.

What impact (if any) does this have on the AD stack? Is the AD “smart enough” to know to aggregate the same values of L[k,l] for differentiation?

Thanks!

Update: doing some timings (locally) this recent loop method is very slow.

Of the things I tried, in order of speed (from slowest to fastest–or least slow in this case):

dot_self, then quad_form_sym then quad_form version is fastest, but still very, very slow.

In python, I would precompute all of the x[i]-x[j] into a matrix, call it X.

Then X is NxNxD and the above computation (in python/numpy) would be:
Omega = (np.dot(X,M) * X).sum(axis=2).

So question: is there a way of doing matrix multiplication with a tensor in stan?

I would like to compute X in the transformed parameters block as dimension NxNxD and do:

some_type Omega_pre_sum[N,N,D] = (X * M) .* X
for (i in 1:N) {
  for (j in 1:N) {
    Omega[i,j] = sum(Omega_pre_sum[i,j,:])
  }
}

is this possible?

3 Likes

I admit, this is getting out of my expertise. Asking @stevebronder or @rok_cesnovar if they have time to discuss this.

Bumping this up to hopefully get you more attention. Maybe @bgoodri has an idea?

I also think that partial indexing via sum(Omega_pre_sum[i,j,]) or sum(Omega_pre_sum[i,j]) should work out of the box.

Thanks for bumping this! Just an update:

Best so far (in terms of time): (assume M = L*L' in DxD)

for (i in 1:N-1) {
      Omega[i,i] = 0.0;
      for (j in (i+1):N) {
        Omega[i,j] = sum((X[i,j]*M) .* X[i,j]);
        Omega[j,i] = Omega[i,j];
      }
    }
    Omega[N,N] = 0.0;

where X is row_vector[D] X[N,N]

I will say, with D = 64 and N = 1000 this is extremely slow. After profiling it, the forward pass is about 1/3 the time and the autodiff takes 2/3.

I would love to “vectorize” this more and take advantage of matrix derivatives if there is a way to do so.

Any ideas are MOST welcome!

Dan

1 Like

Does using a dot_product speed up the calculation?

...
Omega[i, j] = dot_product(X[i, j] * M, X[i, j]);
...

You can try removing the symmetrize allocation and call outside the loop with symmetrize_from_lower_tri(Omega) though I’m not sure you’ll get much from it.

1 Like

That helped a bit. Apologies, but re-looking at the different versions I have tried, the fastest (by a considerable margin) was

for (i in 1:N-1) {
      Omega[i,i] = 0.0;
      for (j in (i+1):N) {
        Omega[i,j] = quad_form(M, X[i,j]);
        Omega[j,i] = Omega[i,j];
      }
    }
    Omega[N,N] = 0.0;

where X is now vector[D] X[N,N]

I guess I could construct a very large block matrix out of the X[i,j] somehow?

1 Like