ARD implementation Gaussian Process

Hello! for
Does anyone know how to implement ARD (automatic relevance determination) for prediction e.g.for 2 input vectors x1,x2?
I checked an old implementation but there was only regarding one input vector x…
Thanks

Take a look at section 10.3 in the Stan User’s Guide . ARD is about 1/3 into this section.

yeah i did!
I am afraid its only for a single input vector x the implementation…

In the ARD section I’m seeing

vector[D] x[N];

Which is a D dimensional vector with each element being an array of size N. This means you have D inputs, and each input has N observations. I also see

vector<lower=0>[D] rho;

which is a vector of D length-scale parameters, one for each input.

matrix L_cov_exp_quad_ARD(vector[] x,
real alpha,
vector rho,
real delta) {
int N = size(x);
matrix[N, N] K;
real sq_alpha = square(alpha);
for (i in 1:(N-1)) {
K[i, i] = sq_alpha + delta;
for (j in (i + 1):N) {
K[i, j] = sq_alpha
* exp(-0.5 * dot_self((x[i] - x[j]) ./ rho));
K[j, i] = K[i, j];
}
}
K[N, N] = sq_alpha + delta;
return cholesky_decompose(K);
}

Here if i have to input vectors instead of one for example x and y i have to put another for loop inside for the parameter rho? I am not sure how to implement it…
Any help appreciated!!

Notice:

K[i, j] = sq_alpha * exp(-0.5 * dot_self((x[i] - x[j]) ./ rho));

(x[i] - x[j]) ./ rho is the difference between inputs x[i] and x[j], scaled by rho. Notice the ./ operator, which is elementwise division. This means each input distance is being scaled by its own length-scale rho, with there being D inputs and D length-scales in total. Although TBH I’m a little confused by the indices, wondering if maybe D and N should be switched.

If you want to write out the covariance function for D dimensions without using matrices, its:

k_{SE}(x_i, x_{j \neq i}, ;\alpha, \rho_1, \ldots, \rho_D) = \alpha^2 \exp\left(-0.5\sum_{d=1}^{D} \frac {(x_i^d-x_j^d)^2} {\rho_d^2} \right)

so for a 2D GP, I think the correct Stan implementation is this:

functions {
  matrix L_cov_exp_quad_ARD(vector x1, 
                            vector x2,
                            real alpha,
                            real rho1,
                            real rho2,
                            real delta) {
    int N = size(x);
    matrix[N, N] K;
    real sq_alpha = square(alpha);
    for (i in 1:(N-1)) {
      
      // Diagonal
      K[i, i] = sq_alpha + delta;
      
      for (j in (i + 1):N) {
        
        // Upper triangle
        K[i, j] = sq_alpha * 
          exp(-0.5*(x1[i] - x1[j])^2 / rho1 - 0.5*(x2[i] - x2[j])^2 / rho2));
        
        // Lower triangle
        K[j, i] = K[i, j];
      }
    }
    // Bottom-right corner
    K[N, N] = sq_alpha + delta;
    
    return cholesky_decompose(K);
  }
}
data {
  int<lower=1> N;
  vector[N] x1;
  vector[N] x2;
  vector[N] y;
}
transformed data {
  real delta = 1e-9;
}
parameters {
  real<lower=0> rho1;
  real<lower=0> rho2;
  real<lower=0> alpha;
  real<lower=0> sigma;
  vector[N] eta;
}
model {
  vector[N] f;
  {
    matrix[N, N] L_K = L_cov_exp_quad_ARD(x1, x2, alpha, rho1, rho2, delta);
    f = L_K * eta;
  }

  rho1 ~ inv_gamma(5, 5);
  rho2 ~ inv_gamma(5, 5);
  alpha ~ std_normal();
  sigma ~ std_normal();
  eta ~ std_normal();

  y ~ normal(f, sigma);
}
2 Likes

Thanks for the reply.
So this is for real (as a type ) lengthscale right?
And maybe if i have a ``` vector<lower=0>[D] rho```` i must use the same you did but with the ./ operator?

Correct. I simplified the 2D model by just having a scalar real parameter for each length-scale. This also requires declaring separate data vectors for each x. However, I think the model that uses matrix/vector operations may be more efficient but I’m not sure.

Written out with separate data vectors for each input x, I don’t think you can use the ./ operator. You would need to do something like:

K[i, j] = sq_alpha * 
          exp(-0.5*(x1[i] - x1[j])^2 / rho[1] - 0.5*(x2[i] - x2[j])^2 / rho[2]));

There must be a more efficient way. or an inside function in stan…i mean if D=10 this would be really slow

Exactly. Which is why the manual implements a D-dimensional GP covariance function like this:

K[i, j] = sq_alpha * exp(-0.5 * dot_self((x[i] - x[j]) ./ rho));

with inputs and length-scales declared as:

vector[D] x[N];
...
vector<lower=0>[D] rho;

Edit: Actually, I just remembered Stan does have some built in covariance functions. I’m not sure if using these instead of the manually coded version would be faster, or why the User’s Guide doesn’t use these.

There will be functions for this in the next release (meaning you don’t have to implement user defined functions for the seperate length scale case).

1 Like

Much appreciated @ScottAlder, hope to see you around more!

1 Like

because in the cov_exo_quad the lengthscale is interpret only as a real number :(

Builtin covariance functions are much faster due to more efficient gradient computations. Thanks to hard work by @anon79882417 we’ll soon also have the support for separate length scales, plus matern and periodic covariance functions in built in functions. We should then add new examples to the user guide.

2 Likes

Were these covariance functions ever merged into stan? I don’t see any reference to them in the current functions documentation

1 Like