Gaussian Process out-of-sample predictive distribution

Thanks for making this so clear and easy to read. I now see why you’re asking this question.

Your understanding is mostly fine, my intent wasn’t to be condescending. I appreciate your questions and I realize that our documentation could be more clear.

Here are the main things I think you’re missing:

  1. This \textbf{y} would just be the time series of responses.
  2. You have to implement the code for the predictive equations.
  3. Start simple.

First, I’m providing a simple example on simulated data below. Hopefully understanding this will enable us to see how to incorporate this method into a larger model. This is a simplified version of example 21.2 in BDA3.

Here’s the function I’m simulating (r-code below, I’ve mean centered and standardized in some areas, but you can re-scale after fitting the model):

y_t(x) = f_1(x) + f_2(x) + \epsilon_t

For:

f_1(x) = 2 x (a simple linear function)
f_2(x) = sin(\frac{x}{2\pi}) (a sine wave)

Which looks like this:

And I’m going to try to recover f_1 and f_2. So I use the following model:

f_1(x) \sim GP(0, k_1), k_1(x, x') = \sigma_1^2exp(\frac{d(x,x')^2}{l_1^2})
f_2(x) \sim GP(0, k_2), k_2(x, x') = \sigma_2^2 (x, x')

For (x, x') an inner product, and d(x, x') the l-2 norm/euclidean distance.

A caveat is, for combining kernels, you have to specify what trend you’d like to make predictions for. So if you want to make predictions of the time series of just the observed linear trend, you’d need to implement the posterior predictive accordingly. So we’ve constructed the following covariance matrix:

k_{total} = k_1 + k_2

If we want to make predictions for k_{1,(\tilde{x}, x)}, the linear function, we can use the equations you’ve found in GPML, but for the cross covariance matrix (the left most one for the posterior predictive mean), you use the covariance structure you’d like to make predictions for. So, for the linear kernel:

\tilde{f}_1 = k_{1,(\tilde{x}, x)} (k_{total} + I \sigma_n^2)^{-1}\textbf{y}

Where k_1 is a cross-covariance matrix computed from the linear kernel, using data we’ve trained our model on and data we’d like to make predictions for (I just did seq(0,100, length.out = 1000) in R). The Stan code for the posterior predictive mean of a GP, which you need to implement in the functions block, being:

  vector gp_pred_dot_prod_rng(vector[] x2,
                              vector y, vector[] x1,
                              real m, real l,
                              real sig,
                              real sigma) {
    int N1 = rows(y);
    int N2 = size(x2);
    vector[N2] f;
    {
      matrix[N1, N1] K = add_diag(gp_exp_quad_cov(x1, m, l) +
                                  gp_dot_prod_cov(x1, sig), sigma);
      matrix[N1, N1] L_K = cholesky_decompose(K);
      vector[N1] L_K_div_y = mdivide_left_tri_low(L_K, y);
      vector[N1] K_div_y = mdivide_right_tri_low(L_K_div_y', L_K)';
      matrix[N1, N2] k_x1_x2 = gp_dot_prod_cov(x1, x2, sig);
      f = (k_x1_x2' * K_div_y);
    }
    return f;
  }

And plotting the estimated \tilde{f}_1 against the simulated f_1:

And then, analogously, for the squared-exponential:

\tilde{f}_2 = k_{2,(\tilde{x}, x)} (k_{total} + I_N \sigma^2)^{-1}\textbf{y}

And finally, the additive model:

\tilde{f}_{total} = k_{total,(\tilde{x}, x)} (k_{total} + I_N \sigma^2)^{-1}\textbf{y}

It’s not perfect, but I hope that it gets the point across. There’s lots of design choices the modeler needs to make. For example, I should have used the periodic covariance function (gp_periodic_cov) instead of the squared exponential, and fixed the period if this information was known. I could have modeled the noise more explicitly for each function. I should have put more effort into thinking about priors, I just used normal(0, 1) to be quick.

I’ve attached all the code (R and Stan code). I understand that I’ve used my own notation and language, and this doesn’t at all unify with the notation in the Stan-Con case study you’ve been using, so I’m happy if you have any questions, and I’ll return to discuss how to do this with the Stan-con example at a later time.

gp_discourse_sim.R (1.9 KB)
linear_exp_quad_gp.stan (3.1 KB)

4 Likes