Derivative of Gaussian Process

Is there a way in Stan to compute the derivative of a Gaussian process for some kernel? I know the derivative of a GP has a nice form since it’s again a GP, but does anybody have suggestions on the best way to implement this?

Do you mean that you just want a posterior on the derivative? If so, you just compute a running lag-1 difference in the generated quantities section. If you mean that you want your model to operate on the scale of the derivative of some observed data, just do a cumulative sum of the GP before the likelihood expression.

Not currently or in near future. You need to derive the derivative yourself. There are Stan code examples available by Gabriel Riutort Mayol (paper [1911.03454] Gaussian process with derivative information for the analysis of the sunlight adverse effects on color of rock art paintings, but the code was not yet in github he’s on vacation) and Eero Siivola (the paper will be posted soon, but the Stan code is here admegp/gp_functions.stan at master · esiivola/admegp · GitHub)

Lag-1 difference or cumsum are not enough if there are derivative observations.


Huh, consider me corrected then! I guess I was only thinking of the case where you have “regular” observations.

Would there be a way to write the spectral density of the GP with derivatives, from your practical Hilbert space paper ( to derive a low-rank approximation of a joint GP and derivative GP? I was able to code up the joint GP but it’s really slow to run HMC with derivative observations and a fine grid of test inputs. I was trying to find a reference somewhere but couldn’t find anything. I’m not even sure if the joint covariance function is stationary anymore. I’m working with a 1-dimensional test input. If you have some tips… Thank you!

Yes. You can think this in from the basis function point of view. The basis function presentation is just a linear model with nonlinear basis functions with priors on the coefficients determined by the spectral densities. The exact derivative of this approximation is obtained by taking the derivative of each basis function (which are easy for as the basis functions are sin and cos, and thus the derivatives are cos and sin) and using the same coefficients for the original basis functions and the derivative basis functions.

1 Like

Hi @avehtari , thank you for your answer! I gave it a try. I have issues estimating the noise variance, but the length scale rho and marginal deviation alpha seem OK. Here’s the data generating process for a joint GP and derivative GP with exponentiated quadratic covariance function (1 dimensional):

functions {
    matrix joint_gp_dgp_exp_cov(real[] x, vector x_vec, row_vector x_rvec,
      real[] y, vector y_vec, row_vector y_rvec,
      real alpha, real rho) {
      int Nx = size(x_vec);
      int Ny = size(y_vec);
      int Ntot = Nx + Ny;
      real inv_sq_rho = inv(square(rho));
      matrix[Ntot, Ntot] K;
      // should be (N+Nobs)*(N+Nobs)=Ns*Ns
      K[1:Nx, 1:Nx] = gp_exp_quad_cov(x, alpha, rho);
      // should be N*N
      K[1:Nx, (Nx+1):(Ntot)] = inv_sq_rho * (rep_matrix(x_vec, Ny) - rep_matrix(y_rvec, Nx)) .* gp_exp_quad_cov(x, y, alpha, rho) ;
      // K_01, should be N*Nobs
      K[(Nx+1):(Ntot), 1:Nx] = K[1:Nx, (Nx+1):(Ntot)]';
      // K_10, should be Nobs*N
      K[(Nx+1):Ntot, (Nx+1):Ntot] = inv_sq_rho * gp_exp_quad_cov(y, alpha, rho) .* (rep_matrix(1.0, Ny, Ny) - inv_sq_rho * (rep_matrix(y_vec .^ 2 , Ny) + rep_matrix(y_rvec .^ 2, Ny) - 2 * y_vec * y_rvec));
      //  K_11, size Nobs*Nobs
    return K;

data {
  int<lower=1> N;
  real x[N];
  real<lower=0> rho;
  real<lower=0> alpha;
  real<lower=0> sigma;

transformed data {
  vector[N] x_vec;
  row_vector[N] x_rvec;
  int Nx = size(x_vec);
  int Ny = size(x_vec);
  int Ntot = Nx + Ny;
  for (n in 1:N) {
    x_vec[n] = x[n];
    x_rvec[n] = x[n];
  matrix[Ntot, Ntot] cov = add_diag(joint_gp_dgp_exp_cov(x, x_vec, x_rvec, 
    x, x_vec, x_rvec, alpha, rho), 1e-10);
  matrix[Ntot, Ntot] L_cov = cholesky_decompose(cov);

parameters {}
model {}

generated quantities {
  vector[Ntot] f = multi_normal_cholesky_rng(rep_vector(0, Ntot), L_cov);
  real y[Ntot] = normal_rng(f, sigma);

and below is the code for the estimation with basis functions and spectral density:

functions {
  vector diagSPD_EQ(real alpha, real rho, real L, int M) {
    return sqrt((alpha^2) * sqrt(2*pi()) * rho * exp(-0.5*(rho*pi()/2/L)^2 * linspaced_vector(M, 1, M)^2));

  matrix PHI_EQ(int N, int M, real L, vector x) {
    return sin(diag_post_multiply(rep_matrix(pi()/(2*L) * (x+L), M), linspaced_vector(M, 1, M)))/sqrt(L);

  matrix D_PHI_EQ(int N, int M, real L, vector x) {
    return diag_post_multiply(pi()/(2*L)*cos(diag_post_multiply(rep_matrix(pi()/(2*L) * (x+L), M), linspaced_vector(M, 1, M))), linspaced_vector(M, 1, M))/sqrt(L);


data {
  int<lower=1> N;
  int<lower=2> Ntot;
  vector[N] x;
  vector[N] y1;
  vector[N] y2;
  real<lower=0> c_f;   // factor c to determine the boundary value L
  int<lower=1> M_f;    // number of basis functions

transformed data {
  vector[2*N] y = append_row(y1, y2);
  real xmean = mean(x);
  real y1mean = mean(y1);
  real y2mean = mean(y2);
  real xsd = sd(x);
  real y1sd = sd(y1);
  real y2sd = sd(y2);
  vector[N] xn = (x - xmean)/xsd;
  vector[N] y1n = (y1 - y1mean)/y1sd;
  vector[N] y2n = (y2 - y2mean)/y2sd;
  // Basis functions for f
  real L_f = c_f*max(xn);
  matrix[N,M_f] PHI_f = PHI_EQ(N, M_f, L_f, xn);
  matrix[N,M_f] D_PHI_f = D_PHI_EQ(N, M_f, L_f, xn);

parameters {
  // real intercept;
  vector[M_f] beta_f;          // the basis functions coefficients
  real<lower=0> rho;
  real<lower=0> alpha;

model {
  // spectral densities for f 
  vector[M_f] diagSPD_f = diagSPD_EQ(alpha, rho, L_f, M_f);
  // priors
  beta_f ~ std_normal();
  rho ~ inv_gamma(2, 0.5); 
  alpha ~ normal(0, 5);
  sigma ~ normal(0,0.5);
  y1n ~ normal(PHI_f * (diagSPD_f .* beta_f), sigma/y1sd);
  y2n ~ normal(D_PHI_f * (diagSPD_f .* beta_f), sigma/y2sd);

and the marginal posterior distributions of alpha (true value = 2.00), rho (true value 1.40), and sigma (true value = 0.10). I use 500 samples for y1 and 500 samples for y2, and c_f = 2.0, M_f = 15, x is a equally spaced grid of 500 points between -1 and 1.

1 Like

Update: the true sigma is in the 95% credible interval of the posterior when I increase the number of basis functions to 200 and boundary factor to 5.0 (might be overkill).

Hi @avehtari,

I was wondering if the code ever got released. This would be very handy to have for a few projects I’m working on.

Which code? We have plenty of code released, so maybe tell a bit more what you are looking for?

I meant the code for this paper.

You need to ask Gabriel about that one. But we are not recommending the specific approach used in the paper anymore if you have just 1D functions (possibly as part of hierarchical model). I could possibly give you better advice if you can tell more what you want to do.

I am sorry to interfere, but the case you just described (1d function, with derivative observations + hierarchical model + some monotonicity and/or sign constraints) is exactly the situation I would like to learn more as well. I have a non-Gaussian likelihood with joint GP and derivative GP prior on a function. The likelihood requires the evaluation of both a function and its derivative evaluation, which motivates the use of this joint GP prior. These GP functions should be hierarchically built, as I only observe about 5-50 points per group, potentially leveraging hierarchy through the hyperparameters and/or the mean of the GP. Right now, the posterior is inferred without constraints on the GP, but I would like to impose constraints as it is consistent with theory. If the approach in the paper on rock art paintings is not recommended, I would like to learn what would be recommended. Thanks!

1 Like

It would probably be better to have a separate thread for monotonic functions and keep this thread only for “Derivative of Gaussian Process”, but reply here to warn here also about the possible problems of virtual derivative approach for monotonicity.

We have a BNP@NeurIPS 2018 workshop paper A non-parametric probabilistic model for monotonic functions (which I forgot to mention earlier) discussing the problems with monotonicity constraint using the derivative approach. Although the approach proposed in that workshop paper seemed at that time promising, there was not much practical advantage compared to I-splines and when transforming GPs, we lose the computational advantages, and the interpretation of the prior on function space after the transformation is also more complicated. We then prioritized other things and haven’t done more on that side. The virtual derivative approach can still be useful in higher dimensional problems where additive model of simple transformations of 1D GP or I-splines is not enough.

I’m very interested to hear your experiences with non-linear monotonic functions and happy to discuss more if needed. I recommend to start with comparing prior simulations using I-splines vs the approach in that above mentioned workshop paper. I really don’t know beforehand which would be better for your specific application, but they would have very similar computational cost.


This goes now even more off-topic to GPs, but sometimes a prior for monotonic functions could be derived using ODEs, too, especially if you have useful knowledge about the phenomenon modelled.

Will try the approach in the paper. Thank you for the pointers. I need to evaluate both a function and its derivative in the likelihood, for the same observations. It looks like a transformed GP prior as described in your 2018 BNP@NeurIPS workshop paper is the way to go, as I can also get the derivative process by taking the derivatives of the eigenfunctions and using the same coefficients alpha for the GP and the derivative. The i-splines also look like they can be differentiated into m-splines, which is good. It would be interesting to know what are the theoretical advantages of either (i-splines vs transformed GPs) beyond the pure computational aspects.

If the code of the paper is available somewhere, I would happily use it [and I can start a new thread if needed].

My experience with virtual locations is that I’ve observed similar behavior than the one you describe in the paper. But there is this paper by Wang and Berger (2016) that implements an algorithm to estimate an “optimal number” of virtual locations in the special case of a Gaussian likelihood and point estimators for the GP hyperparameters.

Sorry to dredge up an old topic, but really interested by the workshop paper you shared. Do you have any code for it that you could share?

Andersen has some code, you can email him (or me if you are not able to find his email address)