Custom distance metric with gaussian process

Hi everyone,

I am wondering whether it is possible to use a custom distance metric with a gaussian process in Stan. I have a classification problem, and the idea is that items which are close together should be assigned to the same class. However, the way of calculating the distance between two items is non-trivial and as far as I can see, non of the implemented kernels do what I want. So far, I have tried to pass a distance matrix to Stan, and then use those distances instead of the (x_{i} - y_{j})^2 term in the exponentiated quadratic function.

This is my code so far:

data {
  int<lower=1> N1;  // N train
  int<lower=1> N2; // N test
  int<lower=1> N;   // train + test
  int<lower=0, upper=1>   x1[N1]; // observations
  matrix [N, N] K;   // distances
  real<lower=0> gamma1; //priors
  real<lower=0> gamma2; //priors

parameters {
  real<lower=0> rho;
  real<lower=0> alpha;
  real a;
  vector[N] eta;

transformed parameters {
  vector[N] f;
  matrix[N, N] L_K;
  matrix[N, N] K2;

  for(i in 1:N)
for(j in 1:N)
  K2[i,j] = square(alpha) * exp(-0.5 * rho * (K[i, j]));

  for (n in 1:N)
K2[n, n] = K2[n, n] + 1e-10;

  L_K = cholesky_decompose(K2);
f = L_K * eta;


model {
  rho ~ inv_gamma(gamma1, gamma2);
  alpha ~ normal(0, 1);
  a ~ normal(0, 1);
  eta ~ normal(0, 1);

  x1 ~ bernoulli_logit(a + f[1:N1]);

generated quantities {
  int z2[N];
  for (n in 1:N)
z2[n] = bernoulli_logit_rng(a + f[n]);

The model at least compiles and samples. But the out of sample predictions are not very good.

Does this set up make sense? Is there a better way to do this?


Will the posterior parameters by in any way physically meaningful in your model? I.e. will they correspond to some theoretic generative assumption? If not, there are many very good clustering algorithms that work directly off of an arbitrary distance matrix (I like affinity propagation for example).

Thank you for your answer emiruz.

Will the posterior parameters by in any way physically meaningful in your model? I.e. will they correspond to some theoretic generative assumption?

I do not know exactly what the generative model should be, other than ‘similar items should belong to the same class’. However, methods like KNN do not work very well on this type of task because they only consider the similarity of item x to a small set of neighbors \{y_i, ..., y_j\}, but class assignment should take into account the similarity of x to all other items. My idea was that a gp could capture this intuition. But maybe not?

I will look into the affinity propagation method you mention.

I guess I’m saying if what you want is a clustering algorithm and you don’t care for the values of the model parameters (rho,alpha,a) and you just want item classifications, then you’re better off using a state of the art clustering algorithm many of which work directly off of an arbitrary distance matrix . AP is great but there are many others. Have a look at exemplar based clustering in general.

I think this is par for the course.

Yep. Here’s code for an ARD exponential quadratic from the user’s guide:

functions {
  matrix L_cov_exp_quad_ARD(vector[] x,
			    real alpha,
			    vector rho,
			    real delta) {
    int N = size(x);
    matrix[N, N] K;
    real sq_alpha = square(alpha);
    for (i in 1:(N-1)) {
      K[i, i] = sq_alpha + delta;
      for (j in (i + 1):N) {
	K[i, j] = sq_alpha
	  * exp(-0.5 * dot_self((x[i] - x[j]) ./ rho));
	K[j, i] = K[i, j];
    K[N, N] = sq_alpha + delta;
    return cholesky_decompose(K);

This is using squared l2-norm (the squared euclidean distance). You’d want to replace this: dot_self(x[i] - x[j]) with whatever distance metric you want to use.

This won’t work, distance is defined implicitly in C++ for each covariance function. Built in ones are implemented as in GPML Chapter 4.

I might verify (prove or code) that the kernel you’ve implemented is real symmetric (K = K^T for K\in \mathbf{R}^{n \times n} ) or positive semi-definite (eigenvalues are not less than 0.)

Thanks for your answer! I am using a variation on Levenshtein’s distance, so I’m not sure this is implementable in Stan, or that it would sample in reasonable time.

This won’t work, distance is defined implicitly in C++ for each covariance function.

I don’t quite understand why I can’t pass the x[i] - x[j] precomputed as data to the model? This is the bit I compute outside.

The built in covariance functions compute the distance implicitly. It doesn’t really make sense to take a pre-computed distance, and then input it into a function that then takes the distance again, right?

You can do whatever you’d like, you just have to be careful to implement the covariance function correctly in the functions block.

Right. I guess a different way of asking the question is, if in your code above I replace

sq_alpha * exp(-0.5 * dot_self((x[i] - x[j]) ./ rho));


K[i,j] = sq_alpha * exp(-0.5 * D[i, j] ./ rho);

where D is a matrix of precomputed distances, would the result be a reasonable thing? is this nonsense?

First it’s important to realize that not all combinations of a metric and a covariance function produce valid positive definite covariance matrix. I’m not familiar with Levenshtein’s distance so I don’t if that is valid with cov_exp_quad.

Thank you Aki.

The result is a definite covariance matrix, and the model does sample. My worry was that there would be something fundamentally wrong about passing precomputed distances, but I guess not?

You can use pre-computed distance matrix (it’s just not supported by cov_exp_quad)