How does one do predictive inference with the sum of gaussian processes?

The manual offers a nice example for a single GP:

data {
  int<lower=1> N1;
  array[N1] real x1;
  vector[N1] y1;
  int<lower=1> N2;
  array[N2] real x2;
}
transformed data {
  real delta = 1e-9;
  int<lower=1> N = N1 + N2;
  array[N] real x;
  for (n1 in 1:N1) {
    x[n1] = x1[n1];
  }
  for (n2 in 1:N2) {
    x[N1 + n2] = x2[n2];
  }
}
parameters {
  real<lower=0> rho;
  real<lower=0> alpha;
  real<lower=0> sigma;
  vector[N] eta;
}
transformed parameters {
  vector[N] f;
  {
    matrix[N, N] L_K;
    matrix[N, N] K = gp_cov_exp_quad(x, alpha, rho);

    // diagonal elements
    for (n in 1:N) {
      K[n, n] = K[n, n] + delta;
    }

    L_K = cholesky_decompose(K);
    f = L_K * eta;
  }
}
model {
  rho ~ inv_gamma(5, 5);
  alpha ~ std_normal();
  sigma ~ std_normal();
  eta ~ std_normal();

  y1 ~ normal(f[1:N1], sigma);
}
generated quantities {
  vector[N2] y2;
  for (n2 in 1:N2) {
    y2[n2] = normal_rng(f[N1 + n2], sigma);
  }
}

Source: Gaussian Processes

Specifically, this line is the key line to specify the kernel
matrix[N, N] K = gp_cov_exp_quad(x, alpha, rho);

Now, if I have multiple kernels, can I just add them together in this line and keep the rest of the code intact? That is, I mean:
matrix[N, N] K = gp_cov_exp_quad(x, alpha, rho) + gp_dot_prod_cov(x, sigma) + gp_exponential_cov(x, sigma2, length_scale);

I am asking because this seems too good to be true, but I am not truly sure. Thank you.

I am going to answer my own question as I discover the solution from reading another thread and the manual:

//https://mc-stan.org/docs/stan-users-guide/gaussian-processes.html#predictive-inference-with-a-gaussian-process
//https://likelyllc.com/2022/08/03/additive-gaussian-process-time-series-regression-in-stan/

functions {
   
  vector gp_pred_rng(array[] vector x2,
                     vector y, 
                     array[] vector x1,
                     real m, 
                     real l,
                     real sig,
                     real sigma,
                     real delta) {
    int N1 = rows(y);
    int N2 = size(x2);
    vector[N2] f2;
    {
      matrix[N1, N1] K = add_diag(gp_exp_quad_cov(x1, m, l) +  gp_dot_prod_cov(x1, sig), square(sigma));
      matrix[N1, N1] L_K = cholesky_decompose(K);
      vector[N1] L_K_div_y = mdivide_left_tri_low(L_K, y);
      vector[N1] R_K_div_y = mdivide_right_tri_low(L_K_div_y', L_K)';
      matrix[N1, N2] k_x1_x2 = gp_exp_quad_cov(x1, x2, m, l) + gp_dot_prod_cov(x1, x2, sig);
      vector[N2] f2_mu = (k_x1_x2' * R_K_div_y);
      matrix[N1, N2] v_pred = mdivide_left_tri_low(L_K, k_x1_x2);
      matrix[N2, N2] cov_f2 = gp_exp_quad_cov(x2, m, l) + gp_dot_prod_cov(x2, sig) - v_pred' * v_pred;
      matrix[N2, N2] diag_delta = diag_matrix(rep_vector(delta, N2));
      f2 = multi_normal_rng(f2_mu, cov_f2 + diag_delta);
    }
    return f2;
  }
}
data {
  int<lower=1> N;
  int<lower=1> D;
  array[N] vector[D] x;
  vector[N] y;

  int<lower=1> N_pred;
  array[N_pred] vector[D] x_pred;
}
transformed data {
  real delta = 1e-9;
  vector[N] mu = rep_vector(0, N);
}
parameters {
  real<lower=0> l;
  real<lower=0> m;
  real<lower=0> sig0;

  real<lower=0> sigma;
}
transformed parameters {
  matrix[N, N] L_K;
  {
    matrix[N, N] K = gp_exp_quad_cov(x, m, l) + gp_dot_prod_cov(x, sig0);
    
    K = add_diag(K, square(sigma));
    L_K = cholesky_decompose(K);
  }
}
model {
  m ~ normal(0, 1);
  l ~ normal(0, 1);
  sig0 ~ normal(0, 1);

  sigma ~ normal(0, 1);
  
  y ~ multi_normal_cholesky(mu, L_K);
}
generated quantities {
  
   vector[N_pred] f_pred;
  vector[N_pred] y_pred;

  f_pred = gp_pred_rng(x_pred, y, x, m, l, sig0, sigma, delta);
  for (n2 in 1:N_pred) {
    y_pred[n2] = normal_rng(f_pred[n2], sigma);
  }
  
}

NOTE: A bug in the code was discovered on 2024-12-11 and has been corrected.

2 Likes

The GP covariance functions in Stan are basically shorthand for looping over all pairwise combinations of the elements in x for the GP hyper-parameters. So depending on how you want to combine different kernels it could be extremely straightforward, or not so much. If I understand correctly, you do want to straight up add the kernels for each (x_i, x_j), which in a loop would be straightforward.

If you unpack this statement into its components under the sum operation, you are creating three N \times N matrices and adding their elements in the same positions – this works elementwise both mathematically and computationally, so it’s not to good to be true, it just is. For multiplication it would not be the case for matrix multiplication (mathematically, at least), but could be using an elementwise multiplication.

1 Like

The example on the cover of the 3rd edition of @andrewgelman et al.'s Bayesian Data Analysis is for @avehtari’s model of the “birthday problem”, which is a GP that adds a bunch of covariance matrices together (see page 505).

Thank you @Bob_Carpenter and @caesoma for responding. I realize my initial question was not clear.

Where I needed help on is NOT the line in the transformed parameters block to specify multiple kernels:
matrix[N, N] K = gp_cov_exp_quad(x, alpha, rho) + gp_dot_prod_cov(x, sigma) + gp_exponential_cov(x, sigma2, length_scale);
I understand this line would be correct for multiple kernels.

Where I needed help is on the predictive inference in the generated quantity block. Specifically, I was not sure whether this line is still ok now that f has multiple kernels.

  for (n2 in 1:N2) {
    y2[n2] = normal_rng(f[N1 + n2], sigma);
  }

I tested it with some simulated data but it appeared to be wrong. But to be honest I am not 100% sure on this.

Nevertheless, I pursued to use the analytical form of joint predictive inference and wrote out the code in my own reply. This seems to be 100% correct.

You want to do predictive inference with the same setup as you fit with. Shouldn’t that be a multivariate normal with covariance given by the GP?

As @Bob_Carpenter mentions, if you’re fitted model is correct, the extrapolations will simply use the same exact construct, just for a different interval.

this block will still apply with whatever kernels f is constructed. As a suggestion, you can try generating within the [1:N1] interval for a more straightforward check – it’s plausible that the kernel sum model is not as good, in which case neither would be the simulations, for instance.

1 Like