Kernel parameter estimation in multivariate Gaussian Process Regression

Disclaimer: I’m new to Stan and have a low level of math training. I followed the examples here to experiment with an application that would be useful for my research and to learn by applying.

I would like to create a general model to estimate multiple \theta s in a gaussian process regression context with an anisotropic, squared exponential kernel of the form

K_{\theta}(x, x') = \alpha^2 \exp\{-\sum^p_{k=1}(x_k - x'_k)^2/\theta_k\}

I basically want to know the ideal shape of the kernel in multi-dimensional space.

To do so, I modified the ARD example in the Stan user guide and got the following implementation:

functions {
  
  matrix cov(
    vector[] x,
    real alpha,
    vector theta
  ) {
    
    // prepare variables
    int N = size(x);
    matrix[N, N] K;
    matrix[N, N] K_cholesky;
    real sq_alpha = square(alpha);

    // calculate covariance matrix
    for (i in 1:(N-1)) {
      K[i, i] = sq_alpha;
      for (j in (i + 1):N) {
        K[i, j] = sq_alpha * exp((-1) * dot_self((x[i] - x[j]) ./ theta));
        K[j, i] = K[i, j];
      }
    }
    K[N, N] = sq_alpha;
    
    // apply cholesky decomposition on covariance matrix
    K_cholesky = cholesky_decompose(K);
    
    return K_cholesky;
    
  }
  
}

data {
  int<lower=1> N;
  int<lower=1> D;
  vector[D] x[N];
  vector[N] y;
}

parameters {
  real<lower=0> alpha;
  vector<lower=0>[D] theta;
  real<lower=0> sigma;
  vector[N] eta;
}

model {
  
  vector[N] f;
  {
    matrix[N, N] L_K = cov(x, alpha, theta);
    f = L_K * eta;
  }

  theta ~ uniform(0, 3);
  sigma ~ std_normal();
  eta ~ std_normal();

  y ~ normal(f, sigma);
  
}

When I run the following test code with Rstan, I get results that generally make sense to me.

ind <- expand.grid(
  x1 = 1:3,
  x2 = 1:3,
  x3 = 1:3
)
dep <- c(rep(-100, 9), rep(0, 9), rep(100, 9))

fit <- rstan::stan(
  file = "code/bayesian_inference/gpr.stan", 
  data = list(
    D = 3,
    x = ind,
    N = length(dep),
    y = dep
  ),
  chains = 1,
  cores = 1
)

\theta_1 and \theta_2 are consistently bigger than \theta_3 which is in my opinion expected with this test data, because the differences on the x3-axis are much bigger then the ones on x1 and x2.

I now have a series of very basic questions and I hope you could give me some advice:

  1. Is this the right approach to estimate the parameters I’m interested in?
  2. I have a very high degree of divergent transitions and I wonder if the reason for this is a bad model?
  3. I adopted \alpha from the ARD example and kept it to avoid an error with the Cholesky decomposition where the matrix tends to get not positive definite. In fact I would like to avoid \alpha to keep the model simpler. Is there a way to achieve this?
1 Like

When using a uniform prior, make the constraints of the parameters match the prior. So do:

vector<lower=0, upper = 3>[D] theta;

But also we encourage people to not put constraints on priors that don’t necessarily need them. Maybe just go with a normal(0, 1.5) prior? That’ll put 95% of the prior mass below 3. If a parameter wants to go outside of 3, let it! Certainly that isn’t something you expected but it might help you learn something about your model.

I expect that’s where the divergences come from.

A lot of the time with GPs people add a very small number to the diagonal to avoid the Cholesky reporting things aren’t positive definite. Like 1e-10, or something. I don’t think this is \alpha doing this.

I guess the simplest test to understand is one where you’ve generated data from the same model you’re fitting.

In this case you’re passing in some data generated without randomness. That can seem simple, but it can lead to convergence problems (noise parameters goes to zero, etc.). Generally pass in noisy stuff when testing models, and generate the data as much like how you’re going to fit it as reasonable.

Edit: get → guess

2 Likes

Thanks for this insight, @bbbales2 ! Very useful. Making the prior for \theta less restrictive did indeed reduce the number of divergent transitions. I added some noise to my test data and now I have no or almost no divergent transitions. Excellent! My real data does not follow a regular grid pattern, so hopefully I wont have problems there as well.

I wanted to follow your advice and generate data from the model, but I have difficulties with the syntax here. Shouldn’t it be possible to simply add something like the following code to the stan script?

generated quantities {
  vector[N] y_sim;
  for(i in 1:N) {
    y_sim[i] = normal_rng(f, sigma);
  }
}

I can’t compile this though, because the vector f does not seem to exist in this generated quantities scope. Is it possible to make it available there or is this logically or syntactically (in stan) not possible?

Good to hear!

Define f in the transformed parameters block. Something like:

transformed parameters {
  vector[N] f;
  {
    matrix[N, N] L_K = cov(x, alpha, theta);
    f = L_K * eta;
  }
}

model {
  theta ~ uniform(0, 3);
  sigma ~ std_normal();
  eta ~ std_normal();

  y ~ normal(f, sigma);
  
}

Things defined in the model block are not exposed to generated quantities (but transformed parameters are).

1 Like

Ok - thank you for this incredibly fast reply! I also had to change the generated quantities section to

generated quantities {
  real y_sim[N];
  y_sim = normal_rng(f, sigma);
}

The result is very interesting for me! I get predictions for y_sim as well as for f (I guess because it is a parameter now as well). They are not exactly equal, but both very close to the input y. I understand this as an indicator that the model makes very accurate predictions in this toy example.

All together that should mean that my model generally behaves as I wanted it to. I will do some further experimenting and eventually go to my real data.

1 Like