Gaussian Process Predictive Posterior Function


#1

As it stands, users currently need to write out functions for predictive posterior distributions for gaussian processes. This isn’t such a big deal for single kernel models, but when the covariance structure becomes more complex, we need write a pred_rng function for each subspace we want to predict. To use a well known example, if we have the sum of kernels ( i.e. bday problem):

y_t(t) = f_1(t) + f_2(t) + f_3(t) + f_4(t) + f_6(t) + f_7(t) + \epsilon

For the predictive mean for the desired covariance subspace (without loss of generality, say, f_1), we have the following mean function:

(1) E(f_1) = K_1(x, x') (K(x, x) + \sigma^2 I)^-1 y

where f_1 \sim GP(\textbf{0}, K_1(x, x))

At least two items need be address:

  1. Currently, in the Stan Documentation, we are only considering the case where we have one kernel (from @rtrangucci’s hierarchical GP case study):
vector gp_pred_rng(real[] x_pred,
vector y_is,
real[] x_is,
real alpha,
real length_scale,
real sigma) {
vector[size(x_pred)] f_pred;
int N_pred;
int N;
N_pred = size(x_pred);
N = rows(y_is);
{
matrix[N, N] L_Sigma;
vector[N] K_div_y_is;
matrix[N, N_pred] k_x_is_x_pred;
matrix[N, N_pred] v_pred;
vector[N_pred] f_pred_mu;
matrix[N_pred, N_pred] cov_f_pred;
matrix[N_pred, N_pred] nug_pred;
matrix[N, N] Sigma;
Sigma = cov_exp_quad(x_is, alpha, length_scale);
for (n in 1:N)
Sigma[n, n] = Sigma[n,n] + square(sigma);
L_Sigma = cholesky_decompose(Sigma);
K_div_y_is = mdivide_left_tri_low(L_Sigma, y_is);
K_div_y_is = mdivide_right_tri_low(K_div_y_is',L_Sigma)';
k_x_is_x_pred = cov_exp_quad(x_is, x_pred, alpha, length_scale);
f_pred_mu = (k_x_is_x_pred' * K_div_y_is);
v_pred = mdivide_left_tri_low(L_Sigma, k_x_is_x_pred);
cov_f_pred = cov_exp_quad(x_pred, alpha, length_scale) - v_pred' * v_pred;
nug_pred = diag_matrix(rep_vector(1e-12,N_pred));
f_pred = multi_normal_rng(f_pred_mu, cov_f_pred + nug_pred);
}
return f_pred;
}

whereas, when we have a more complex covariance structure, continuing the example from above, we need to leverage the entire covariance structure, and then project onto a subspace with the desired structure. Consider BDA3, section 21.2, where we want the predictive posterior for the “weakly quasi-periodic” pattern, we need implement a pred_rng function that looks something like this:

  vector gp_pred_periodic_exp_quad_rng(real[] x2,
                     vector y, real[] x1,
                     matrix k,
                     real magnitude_1, real length_scale_1,
                     real magnitude_2, real length_scale_2, real period,
                     real sigma, real jitter) {                              
    // x2:               test data
    // x1:               training data
    // magnitude_1:      magnitude of squared exponential kernel
    // length_scale_1    length scale of squared exponential kernel
    // magnitude_2:        magnitude of periodic kernel
    // length_scale_2:   length_scale of periodic kernel
    // period:           perio of periodic kernel
    // sigma:         model noise
    // jitter:        for numerical stability for inverse, etc.

    int N1 = rows(y);
    int N2 = size(x2);
    vector[N2] f;
    {
      matrix[N1, N1] K = k; // + diag_matrix(rep_vector(square(sigma), N1));
      matrix[N1, N1] L_K = cholesky_decompose(K);

      vector[N1] L_K_div_y = mdivide_left_tri_low(L_K, y);
      vector[N1] K_div_y = mdivide_right_tri_low(L_K_div_y', L_K)';
      matrix[N1, N2] k_x1_x2 = gp_periodic_cov(x1, x2, magnitude_2, length_scale_2, period) .*
                               cov_exp_quad(x1, x2,  magnitude_1, length_scale_1);
      vector[N2] f_mu = (k_x1_x2' * K_div_y);

      matrix[N1, N2] v_pred = mdivide_left_tri_low(L_K, k_x1_x2);
      matrix[N2, N2] cov_f =   gp_periodic_cov(x2, magnitude_2, length_scale_2, period) .*
                               cov_exp_quad(x2,  magnitude_1, length_scale_1) - v_pred' * v_pred   
                              + diag_matrix(rep_vector(jitter, N2)); 
     f = multi_normal_rng(f_mu, cov_f_diag);
    }
    return f;
  }

which is just eq (1) above, where K_1(x,x') is the predictive covariance for training and test sets. One can check out R&W for the equations for the predictive variance.

I haven’t worked out all the details (I’m fairly certain the functional function signature below is inaccurate), but to make gaussian processes in Stan more manageable, I’m proposing a function signature where we can read in the entire kernel, that either pass in a pointer to a function that computes the necessary test data kernel, or similarly, takes in the pre-computed cross correlation between test and training sets, and the total kernel, and then returns the desired predictive distribution (and, if need be, the test-test correlation). The function signature for the latter would look like this:

gp_pred(matrix test_kernel, matrix test_train_kernel, matrix train_kernel)

or, if we go with the former, something like this:

gp_pred(func pred_kernel, matrix x_train, matrix x_test, vector parameters[], matrix train_kernel)

Another caveat that need be discussed is the stability of the predictive variance in high dimensions. After spending many hours of my time wondering why my predictive variance was not positive definite, I mustered up some courage to ask @avehtari why this was the case. It turns out, that in high dimensions, the predictive variance is often unstable. As a default in GPstuff, the default is to only use the diagonal of the predictive variance matrix, which tends to over estimate uncertainty. You’ll notice in the BDAY problem, that only the predictive mean was used, likely due to the high dimensional instability of this structure. So if we write a function like this, what should the defaults be? Just a predictive mean? or only use the diagonal of the predictive variance?

Thoughts on this?


#2

Are we just talking about conditional distributions here for a multivariate normal?

So like,

y \sim N(\mu, \Sigma)

Where y = \begin{pmatrix} y_1\\ y_2 \end{pmatrix}, \mu = \begin{pmatrix} \mu_1\\ \mu_2 \end{pmatrix}

and \Sigma = \begin{pmatrix} \Sigma_{11} & \Sigma_{12}\\ \Sigma_{12}^T & \Sigma_{22} \end{pmatrix}

And the conditional probability of y_1 given y_2 is (I’m just copying from https://www.stat.ubc.ca/~will/cx/private/normal_properties.pdf so I don’t get stuff wrong – I’d assume it’s more intuitive to write it y_2 given y_1):

p(y_1 | y_2) = \text{normal_pdf}(\mu_1 - \Sigma_{12} \Sigma_{22}^{-1}(y_2 - \mu_2), \Sigma_{11} - \Sigma_{12} \Sigma_{22}^{-1} \Sigma_{12}^T)

That seems like it’d be a handy function to have. I’d prefer \Sigma_{11}, \Sigma_{12}, \Sigma_{22} to train/split in terms of verbage.

You’re saying that the covariance of p(y_1 | y_2) ends up not being positive definite? And it’s not something that can easily be fixed with jitter or whatever? I could see where the solves would get way easier if you only tried to compute the diagonal, at least. How quickly does this become a problem?


#3

We should keep that in mind when we try to fit it for @andrewgelman who would really really like to see someone code it in Stan. But how will that work if it’s not stable enough for full Bayes?

(I don’t know if the data’s available—it’s an example Aki coded for the third edition of Bayesian Data Analysis.)


#4

That’s not really what I said. Predictive mean is used as it is already illustrative. If we would add uncertainties for the plots, it would be enough to compute marginal variances, ie only the diagonal. Both marginal variances and full covariance for the full GP model for birthday data are stable to compute. What is more challenging is the variances and full covariance for parts of the additive function, ie when function is f = f_1 + f_2 + f_3… and we plot f_1, f_2, and f_3 separately, the means are easily computed, but when we compute the (co)variance of f_1 the uncertainty of f_2 and f_3 are included and that uncertainty typically means there is much uncertainty on the level of the function, which implies large constant term in the covariance matrix making it less stable. Even if we could stabilize the covariance term for f_1, we need additional tricks to compute as if f_2 and f_3 are known. This is something, e.g., Jarno Vanahtalo is using in his fish ecology modelling.

Summary: 1) For many plots of just f we care only about variances and thus why by default GPstuff computes only those. 2) For additive and multivariate models, when focusing in one component or dimension, the computation of uncertainties require more than computing the mean and covariance matrix can be less stable.

@drezap is the someone coding it in Stan. He has made some good progress. It’s possible that in order to keep the memory use of autodiff in control for the birthday problem we really need to present covariance functions as functions and call covariance matrix generation with that function, that is, we provide on function argument as for ODE solvers and algebraic solver.

Data is available and the source is also mentioned in BDA3.