Derivative of Gaussian Process

Ian.Frankenburg · August 7, 2020, 9:04pm

Is there a way in Stan to compute the derivative of a Gaussian process for some kernel? I know the derivative of a GP has a nice form since it’s again a GP, but does anybody have suggestions on the best way to implement this?

mike-lawrence · August 7, 2020, 9:58pm

Do you mean that you just want a posterior on the derivative? If so, you just compute a running lag-1 difference in the generated quantities section. If you mean that you want your model to operate on the scale of the derivative of some observed data, just do a cumulative sum of the GP before the likelihood expression.

avehtari · August 8, 2020, 9:36am

Not currently or in near future. You need to derive the derivative yourself. There are Stan code examples available by Gabriel Riutort Mayol (paper [1911.03454] Gaussian process with derivative information for the analysis of the sunlight adverse effects on color of rock art paintings, but the code was not yet in github he’s on vacation) and Eero Siivola (the paper will be posted soon, but the Stan code is here https://github.com/esiivola/admegp/blob/master/src/stan/utils/gp_functions.stan)

Lag-1 difference or cumsum are not enough if there are derivative observations.

mike-lawrence · August 8, 2020, 10:10am

Huh, consider me corrected then! I guess I was only thinking of the case where you have “regular” observations.

samlevy · February 11, 2022, 1:55am

Would there be a way to write the spectral density of the GP with derivatives, from your practical Hilbert space paper (https://arxiv.org/pdf/2004.11408.pdf) to derive a low-rank approximation of a joint GP and derivative GP? I was able to code up the joint GP but it’s really slow to run HMC with derivative observations and a fine grid of test inputs. I was trying to find a reference somewhere but couldn’t find anything. I’m not even sure if the joint covariance function is stationary anymore. I’m working with a 1-dimensional test input. If you have some tips… Thank you!

avehtari · February 14, 2022, 12:01pm

Yes. You can think this in from the basis function point of view. The basis function presentation is just a linear model with nonlinear basis functions with priors on the coefficients determined by the spectral densities. The exact derivative of this approximation is obtained by taking the derivative of each basis function (which are easy for as the basis functions are sin and cos, and thus the derivatives are cos and sin) and using the same coefficients for the original basis functions and the derivative basis functions.

samlevy · February 15, 2022, 3:28pm

Hi @avehtari , thank you for your answer! I gave it a try. I have issues estimating the noise variance, but the length scale rho and marginal deviation alpha seem OK. Here’s the data generating process for a joint GP and derivative GP with exponentiated quadratic covariance function (1 dimensional):

functions {
    matrix joint_gp_dgp_exp_cov(real[] x, vector x_vec, row_vector x_rvec,
      real[] y, vector y_vec, row_vector y_rvec,
      real alpha, real rho) {
      int Nx = size(x_vec);
      int Ny = size(y_vec);
      int Ntot = Nx + Ny;
      real inv_sq_rho = inv(square(rho));
      matrix[Ntot, Ntot] K;
      // should be (N+Nobs)*(N+Nobs)=Ns*Ns
      K[1:Nx, 1:Nx] = gp_exp_quad_cov(x, alpha, rho);
      // should be N*N
      K[1:Nx, (Nx+1):(Ntot)] = inv_sq_rho * (rep_matrix(x_vec, Ny) - rep_matrix(y_rvec, Nx)) .* gp_exp_quad_cov(x, y, alpha, rho) ;
      // K_01, should be N*Nobs
      K[(Nx+1):(Ntot), 1:Nx] = K[1:Nx, (Nx+1):(Ntot)]';
      // K_10, should be Nobs*N
      K[(Nx+1):Ntot, (Nx+1):Ntot] = inv_sq_rho * gp_exp_quad_cov(y, alpha, rho) .* (rep_matrix(1.0, Ny, Ny) - inv_sq_rho * (rep_matrix(y_vec .^ 2 , Ny) + rep_matrix(y_rvec .^ 2, Ny) - 2 * y_vec * y_rvec));
      //  K_11, size Nobs*Nobs
    return K;
  }
}

data {
  int<lower=1> N;
  real x[N];
  real<lower=0> rho;
  real<lower=0> alpha;
  real<lower=0> sigma;
}

transformed data {
  vector[N] x_vec;
  row_vector[N] x_rvec;
  int Nx = size(x_vec);
  int Ny = size(x_vec);
  int Ntot = Nx + Ny;
  for (n in 1:N) {
    x_vec[n] = x[n];
    x_rvec[n] = x[n];
  }
  matrix[Ntot, Ntot] cov = add_diag(joint_gp_dgp_exp_cov(x, x_vec, x_rvec, 
    x, x_vec, x_rvec, alpha, rho), 1e-10);
  matrix[Ntot, Ntot] L_cov = cholesky_decompose(cov);
}

parameters {}
model {}

generated quantities {
  vector[Ntot] f = multi_normal_cholesky_rng(rep_vector(0, Ntot), L_cov);
  real y[Ntot] = normal_rng(f, sigma);
}

and below is the code for the estimation with basis functions and spectral density:

functions {
  vector diagSPD_EQ(real alpha, real rho, real L, int M) {
    return sqrt((alpha^2) * sqrt(2*pi()) * rho * exp(-0.5*(rho*pi()/2/L)^2 * linspaced_vector(M, 1, M)^2));
  }

  matrix PHI_EQ(int N, int M, real L, vector x) {
    return sin(diag_post_multiply(rep_matrix(pi()/(2*L) * (x+L), M), linspaced_vector(M, 1, M)))/sqrt(L);
  }

  matrix D_PHI_EQ(int N, int M, real L, vector x) {
    return diag_post_multiply(pi()/(2*L)*cos(diag_post_multiply(rep_matrix(pi()/(2*L) * (x+L), M), linspaced_vector(M, 1, M))), linspaced_vector(M, 1, M))/sqrt(L);
  }

}

data {
  int<lower=1> N;
  int<lower=2> Ntot;
  vector[N] x;
  vector[N] y1;
  vector[N] y2;
  real<lower=0> c_f;   // factor c to determine the boundary value L
  int<lower=1> M_f;    // number of basis functions
}

transformed data {
  vector[2*N] y = append_row(y1, y2);
  real xmean = mean(x);
  real y1mean = mean(y1);
  real y2mean = mean(y2);
  real xsd = sd(x);
  real y1sd = sd(y1);
  real y2sd = sd(y2);
  vector[N] xn = (x - xmean)/xsd;
  vector[N] y1n = (y1 - y1mean)/y1sd;
  vector[N] y2n = (y2 - y2mean)/y2sd;
  // Basis functions for f
  real L_f = c_f*max(xn);
  matrix[N,M_f] PHI_f = PHI_EQ(N, M_f, L_f, xn);
  matrix[N,M_f] D_PHI_f = D_PHI_EQ(N, M_f, L_f, xn);
}

parameters {
  // real intercept;
  vector[M_f] beta_f;          // the basis functions coefficients
  real<lower=0> rho;
  real<lower=0> alpha;
  real<lower=0>sigma;
}

model {
  // spectral densities for f 
  vector[M_f] diagSPD_f = diagSPD_EQ(alpha, rho, L_f, M_f);
  // priors
  beta_f ~ std_normal();
  rho ~ inv_gamma(2, 0.5); 
  alpha ~ normal(0, 5);
  sigma ~ normal(0,0.5);
  y1n ~ normal(PHI_f * (diagSPD_f .* beta_f), sigma/y1sd);
  y2n ~ normal(D_PHI_f * (diagSPD_f .* beta_f), sigma/y2sd);
}

and the marginal posterior distributions of alpha (true value = 2.00), rho (true value 1.40), and sigma (true value = 0.10). I use 500 samples for y1 and 500 samples for y2, and c_f = 2.0, M_f = 15, x is a equally spaced grid of 500 points between -1 and 1.

samlevy · February 15, 2022, 8:14pm

Update: the true sigma is in the 95% credible interval of the posterior when I increase the number of basis functions to 200 and boundary factor to 5.0 (might be overkill).

maxbiostat · February 15, 2022, 9:37pm

Hi @avehtari,

I was wondering if the code ever got released. This would be very handy to have for a few projects I’m working on.

avehtari · February 16, 2022, 8:08am

Which code? We have plenty of code released, so maybe tell a bit more what you are looking for?

maxbiostat · February 16, 2022, 9:49am

I meant the code for this paper.

avehtari · February 16, 2022, 10:00am

You need to ask Gabriel about that one. But we are not recommending the specific approach used in the paper anymore if you have just 1D functions (possibly as part of hierarchical model). I could possibly give you better advice if you can tell more what you want to do.

samlevy · February 16, 2022, 2:50pm

I am sorry to interfere, but the case you just described (1d function, with derivative observations + hierarchical model + some monotonicity and/or sign constraints) is exactly the situation I would like to learn more as well. I have a non-Gaussian likelihood with joint GP and derivative GP prior on a function. The likelihood requires the evaluation of both a function and its derivative evaluation, which motivates the use of this joint GP prior. These GP functions should be hierarchically built, as I only observe about 5-50 points per group, potentially leveraging hierarchy through the hyperparameters and/or the mean of the GP. Right now, the posterior is inferred without constraints on the GP, but I would like to impose constraints as it is consistent with theory. If the approach in the paper on rock art paintings is not recommended, I would like to learn what would be recommended. Thanks!

avehtari · February 16, 2022, 3:54pm

It would probably be better to have a separate thread for monotonic functions and keep this thread only for “Derivative of Gaussian Process”, but reply here to warn here also about the possible problems of virtual derivative approach for monotonicity.

We have a BNP@NeurIPS 2018 workshop paper A non-parametric probabilistic model for monotonic functions (which I forgot to mention earlier) discussing the problems with monotonicity constraint using the derivative approach. Although the approach proposed in that workshop paper seemed at that time promising, there was not much practical advantage compared to I-splines and when transforming GPs, we lose the computational advantages, and the interpretation of the prior on function space after the transformation is also more complicated. We then prioritized other things and haven’t done more on that side. The virtual derivative approach can still be useful in higher dimensional problems where additive model of simple transformations of 1D GP or I-splines is not enough.

I’m very interested to hear your experiences with non-linear monotonic functions and happy to discuss more if needed. I recommend to start with comparing prior simulations using I-splines vs the approach in that above mentioned workshop paper. I really don’t know beforehand which would be better for your specific application, but they would have very similar computational cost.

avehtari · February 16, 2022, 4:41pm

This goes now even more off-topic to GPs, but sometimes a prior for monotonic functions could be derived using ODEs, too, especially if you have useful knowledge about the phenomenon modelled.

samlevy · February 16, 2022, 10:41pm

Will try the approach in the paper. Thank you for the pointers. I need to evaluate both a function and its derivative in the likelihood, for the same observations. It looks like a transformed GP prior as described in your 2018 BNP@NeurIPS workshop paper is the way to go, as I can also get the derivative process by taking the derivatives of the eigenfunctions and using the same coefficients alpha for the GP and the derivative. The i-splines also look like they can be differentiated into m-splines, which is good. It would be interesting to know what are the theoretical advantages of either (i-splines vs transformed GPs) beyond the pure computational aspects.

If the code of the paper is available somewhere, I would happily use it [and I can start a new thread if needed].

My experience with virtual locations is that I’ve observed similar behavior than the one you describe in the paper. But there is this paper by Wang and Berger (2016) that implements an algorithm to estimate an “optimal number” of virtual locations in the special case of a Gaussian likelihood and point estimators for the GP hyperparameters.

Simon_Slangen · August 3, 2022, 11:04am

Sorry to dredge up an old topic, but really interested by the workshop paper you shared. Do you have any code for it that you could share?

avehtari · August 3, 2022, 1:06pm

Andersen has some code, you can email him (or me if you are not able to find his email address)

Topic		Replies	Views
Applied Gaussian Processes in Stan, Part 1. A Case Study Modeling	12	2705	November 21, 2019
Seeking advice on optimizing a Gaussian Process (GP) prior to improve my model's convergence time Modeling	5	420	June 1, 2024
Could you please help to find out why my GP model got slower Modeling gaussian-process	12	1227	April 7, 2021
Gaussian process regression General	14	989	April 2, 2021
Gaussian Process with Binomial Likelihood General gaussian-process	6	1375	January 20, 2022

Derivative of Gaussian Process

Related topics