As it stands, users currently need to write out functions for predictive posterior distributions for gaussian processes. This isn’t such a big deal for single kernel models, but when the covariance structure becomes more complex, we need write a pred_rng
function for each subspace we want to predict. To use a well known example, if we have the sum of kernels ( i.e. bday problem):
y_t(t) = f_1(t) + f_2(t) + f_3(t) + f_4(t) + f_6(t) + f_7(t) + \epsilon
For the predictive mean for the desired covariance subspace (without loss of generality, say, f_1), we have the following mean function:
(1) E(f_1) = K_1(x, x') (K(x, x) + \sigma^2 I)^-1 y
where f_1 \sim GP(\textbf{0}, K_1(x, x))
At least two items need be address:
- Currently, in the Stan Documentation, we are only considering the case where we have one kernel (from @rtrangucci’s hierarchical GP case study):
vector gp_pred_rng(real[] x_pred,
vector y_is,
real[] x_is,
real alpha,
real length_scale,
real sigma) {
vector[size(x_pred)] f_pred;
int N_pred;
int N;
N_pred = size(x_pred);
N = rows(y_is);
{
matrix[N, N] L_Sigma;
vector[N] K_div_y_is;
matrix[N, N_pred] k_x_is_x_pred;
matrix[N, N_pred] v_pred;
vector[N_pred] f_pred_mu;
matrix[N_pred, N_pred] cov_f_pred;
matrix[N_pred, N_pred] nug_pred;
matrix[N, N] Sigma;
Sigma = cov_exp_quad(x_is, alpha, length_scale);
for (n in 1:N)
Sigma[n, n] = Sigma[n,n] + square(sigma);
L_Sigma = cholesky_decompose(Sigma);
K_div_y_is = mdivide_left_tri_low(L_Sigma, y_is);
K_div_y_is = mdivide_right_tri_low(K_div_y_is',L_Sigma)';
k_x_is_x_pred = cov_exp_quad(x_is, x_pred, alpha, length_scale);
f_pred_mu = (k_x_is_x_pred' * K_div_y_is);
v_pred = mdivide_left_tri_low(L_Sigma, k_x_is_x_pred);
cov_f_pred = cov_exp_quad(x_pred, alpha, length_scale) - v_pred' * v_pred;
nug_pred = diag_matrix(rep_vector(1e-12,N_pred));
f_pred = multi_normal_rng(f_pred_mu, cov_f_pred + nug_pred);
}
return f_pred;
}
whereas, when we have a more complex covariance structure, continuing the example from above, we need to leverage the entire covariance structure, and then project onto a subspace with the desired structure. Consider BDA3, section 21.2, where we want the predictive posterior for the “weakly quasi-periodic” pattern, we need implement a pred_rng
function that looks something like this:
vector gp_pred_periodic_exp_quad_rng(real[] x2,
vector y, real[] x1,
matrix k,
real magnitude_1, real length_scale_1,
real magnitude_2, real length_scale_2, real period,
real sigma, real jitter) {
// x2: test data
// x1: training data
// magnitude_1: magnitude of squared exponential kernel
// length_scale_1 length scale of squared exponential kernel
// magnitude_2: magnitude of periodic kernel
// length_scale_2: length_scale of periodic kernel
// period: perio of periodic kernel
// sigma: model noise
// jitter: for numerical stability for inverse, etc.
int N1 = rows(y);
int N2 = size(x2);
vector[N2] f;
{
matrix[N1, N1] K = k; // + diag_matrix(rep_vector(square(sigma), N1));
matrix[N1, N1] L_K = cholesky_decompose(K);
vector[N1] L_K_div_y = mdivide_left_tri_low(L_K, y);
vector[N1] K_div_y = mdivide_right_tri_low(L_K_div_y', L_K)';
matrix[N1, N2] k_x1_x2 = gp_periodic_cov(x1, x2, magnitude_2, length_scale_2, period) .*
cov_exp_quad(x1, x2, magnitude_1, length_scale_1);
vector[N2] f_mu = (k_x1_x2' * K_div_y);
matrix[N1, N2] v_pred = mdivide_left_tri_low(L_K, k_x1_x2);
matrix[N2, N2] cov_f = gp_periodic_cov(x2, magnitude_2, length_scale_2, period) .*
cov_exp_quad(x2, magnitude_1, length_scale_1) - v_pred' * v_pred
+ diag_matrix(rep_vector(jitter, N2));
f = multi_normal_rng(f_mu, cov_f_diag);
}
return f;
}
which is just eq (1) above, where K_1(x,x') is the predictive covariance for training and test sets. One can check out R&W for the equations for the predictive variance.
I haven’t worked out all the details (I’m fairly certain the functional function signature below is inaccurate), but to make gaussian processes in Stan more manageable, I’m proposing a function signature where we can read in the entire kernel, that either pass in a pointer to a function that computes the necessary test data kernel, or similarly, takes in the pre-computed cross correlation between test and training sets, and the total kernel, and then returns the desired predictive distribution (and, if need be, the test-test correlation). The function signature for the latter would look like this:
gp_pred(matrix test_kernel, matrix test_train_kernel, matrix train_kernel)
or, if we go with the former, something like this:
gp_pred(func pred_kernel, matrix x_train, matrix x_test, vector parameters[], matrix train_kernel)
Another caveat that need be discussed is the stability of the predictive variance in high dimensions. After spending many hours of my time wondering why my predictive variance was not positive definite, I mustered up some courage to ask @avehtari why this was the case. It turns out, that in high dimensions, the predictive variance is often unstable. As a default in GPstuff, the default is to only use the diagonal of the predictive variance matrix, which tends to over estimate uncertainty. You’ll notice in the BDAY problem, that only the predictive mean was used, likely due to the high dimensional instability of this structure. So if we write a function like this, what should the defaults be? Just a predictive mean? or only use the diagonal of the predictive variance?
Thoughts on this?