Practical Hilbert space approximate Bayesian Gaussian processes for probabilistic programming

We have a new paper Gabriel Riutort-Mayol, Paul-Christian Bürkner, Michael R. Andersen, Arno Solin, Aki Vehtari (2020). Practical Hilbert space approximate Bayesian Gaussian processes for probabilistic programming. https://arxiv.org/abs/2004.11408 with code in Stan and brms

The approximation itself is not new as it is based on the Hilbert space basis function approach by Solin and Särkkä https://doi.org/10.1007/s11222-019-09886-w

The present paper adds

  • analysis of the performance and diagnostics in relation to the number of basis functions, domain of the prediction space, and smoothness of the latent function
  • practical implementation details for Stan

In addition of simulations, we re-create two models from BDA3 (which were made with GPstuff):

  • birthday model with 7305 observations in one dimension
  • leukemia survival model with 1043 observations in four dimensions with full interactions

In addition of having code for Stan at https://github.com/gabriuma/basis_functions_approach_to_GP, this approximation has been for some time in brms https://rdrr.io/cran/brms/man/gp.html, but we haven’t been advertising it much before we got the paper ready.

When GPs are implemented using covariance matrices, the computation cost is O(n^3) for every log density evaluation. This reduced rank approximation has pre-computation cost O(m^2n) for m basis functions, but after that the computational cost for log density evaluations is O(mn + m). The number of basis functions needed depends on the wigglines of the target and the number of dimensions. The scaling in dimensions is exponential restricting the practical use for 1-4 dimensions with full interactions. With additive models the scaling is linear in the number of additive components. The additional benefits of the basis function approximation in Stan is the reduced size of the autodiff tree and that we can use compound glm functions in for additional speedup.

EDIT: added a paragraph on computational cost.

18 Likes

Hi Aki, I have a quick question about this paper. I am try to trying to implement the approximation for an exponential covariance function (\nu = 1/2), which isn’t explicitly presented in the paper


But, it should be pretty straightforward to substitute into the formula for Matern class spectral density. When I did this, however, I noticed that the 4\pi^2 factor seems to be missing from equations (1), (2), and (3). Looking at some of the reference materials I also noticed that Rasmussen and Williams use one convention of the Fourier transform with the 2\pi in the power of the complex exponential and Solin and Särkkä use a convention with a \frac{1}{2\pi} factor outside the integral.

I did a little testing and the method in the Riutort et al. methods works with the equations (1), (2), and (3) as presented but seemingly not if one re-derives them using the Rasmussen and Williams spectral density equation and keeps the 4\pi^2 factor.

Can you provide any clarity here?

1 Like

Sorry, these are not things that get stuck in my memory and I don’t have now time to rederive these. Our equations were checked by Solin (also a co-author), so I trust them. You could email Solin for clarification (he has been thinking these more than I, so might remember the reasons for different looking equations)

No worries. I’m fairly sure it just has to do with the convention of the Fourier transform used. Looking at the \nu=\infty case using the Rasmussen and Williams convention:
S(\omega) = \int_Re^{r^2/2l^2}e^{-2\pi i\omega r}dr=\sqrt{2\pi}le^{-2\pi^2l^2\omega^2} which matches the formula given in the textbook (pg.83)
Using the Solin and Särrkä convention:
S(\omega) = \int_Re^{r^2/2l^2}e^{-i\omega r}dr=\sqrt{2\pi}le^{-1/2l^2\omega^2} which matches the formula in the Riutort et al. paper.

So it seems like for the general case which I wasn’t able to verify the integration for one substitutes \omega = \omega/2\pi in the Rasmussen and Williams equation to get equations (1) - (3).

2 Likes

Hi! I’ve got a question regarding out-of-sample prediction using HSGP, not sure whether this is the right place to ask …

I can see that if my GP has a nugget, then under HSGP, I can derive the conditional distribution for out-of-sample locations. By Woodbury’s identity, I’d need to invert an M by M matrix instead of n by n (here n is the sample size, M is the number of basis functions).

However, I want to use GP for spatially varying coefficient and I’m not planning on using nugget. In this case, the covariance matrix of the conditional distribution for out-of-sample locations would be singular under HSGP. I’d like to ask usually how this is taken care of?

Thanks!!

Can you write out your derivation for the covariance matrix? My guess is that that there shouldn’t be an issue. You may also want to double check if the matrix is truly singular or just numerically so (i.e. adding a small jitter to the diagonal solves the issue).

Yeah for sure, and I don’t think this a numerical issue.

Suppose I have a GLM: y_i \sim Poisson(\theta(d_i)), \theta(d) \sim GP(0, \alpha^2 K(l)), where \alpha is the GP scale, l is the lengthscale, and K(l) is a certain correlation kernel. Say I use HSGP with M basis functions. My data has a sample size of n_1, and I want to make prediction for n_2 locations. Let \theta_1 \in R^{n_1} denote the parameter for the observed locations, and \theta_2 \in R^{n_2} for locations for prediction. Under HSGP, their joint distribution is approximately N(0, \Phi \Lambda \Phi^T), where \Phi has dimension ((n_1+n_2) \times M), \Lambda is diagonal with dimension M \times M.

And \Phi can be separated into two parts, \Phi_1 \in R^{n_1 \times M} and \Phi_2 \in R^{n_2 \times M}, one for \theta_1 and one for \theta_2. So if I want to make prediction for \theta_2 based on posterior samples of \theta_1, I’d need to use the conditional distribution. For which I’d need to invert the following matrix: \Phi_1 \Lambda \Phi_1^T. If M<n_1 which is usually the case, this matrix would be singular.

If I have a nugget there, the matrix I need to invert would instead be \Phi_1 \Lambda \Phi_1^T + \sigma^2 I, which wouldn’t be singular, and everything can proceed no problem. However, for such examples like GLM, or other situation where it’s not natural to have nuggets there, I’m not sure what’s the standard practice.

Let me know if anything’s unclear.

Thanks for providing this. I’ll think it over and try to post a reply soon (unless someone else gets to it before I do).

1 Like

So for the non-Gaussian outcome I think you would need to use the latent variable formulation (e.g. Gaussian Processes, were f would be computed using the HSGP instead of the Cholesky decomposition of the full K) which doesn’t use those matrix inversions.

However, for a noise free Gaussian outcome we would want to use the analytical form of the joint distribution and I think it works just fine. I’m unsure why \Phi_1 \Lambda \Phi_1^T would necessarily be singular. \Phi_1 \Lambda \Phi_1^T has dimension (n_1 \times M) \times (M \times M) \times (M \times n_1) = n_1 \times n_1 and should be full rank. I don’t think the matrix inversion lemma applies when we don’t have a nugget.
Below is some code implementing the HSGP and making predictions under known parameters. Note that I use the following terminology for the block covariances

\boldsymbol{\Sigma}_{joint} = \begin{bmatrix} \boldsymbol{\Sigma}_{obs,obs} & \boldsymbol{\Sigma}_{obs,pred} \\ \boldsymbol{\Sigma}_{pred,obs} & \boldsymbol{\Sigma}_{pred,pred} \end{bmatrix} = \begin{bmatrix} \boldsymbol{\Sigma} & \boldsymbol{\Sigma}_{0}^T \\ \boldsymbol{\Sigma}_{0} & \boldsymbol{\Omega} \end{bmatrix}
library(fields)
library(patchwork)

c_light <- c("#DCBCBC") #colors
c_light_trans <- adjustcolor( c_light, alpha.f = 0.7)
c_light_highlight <- c("#C79999")
c_mid <- c("#B97C7C")
c_mid_trans <- adjustcolor(c_mid, alpha.f = 0.7)
c_mid_highlight <- c("#A25050")
c_dark <- c("#8F2727")
c_dark_trans <- adjustcolor(c_dark, alpha.f = 0.7)
c_dark_highlight <- c("#7C0000")
c_light_teal <- c("#6B8E8E")
c_mid_teal <- c("#487575")
c_dark_teal <- c("#1D4F4F")

set.seed(420)
n_obs <- 50
n_pred <- 500
jitter <- 1E-6

x_latent <- seq(-5,5,length.out = 1000)
obs_idx <- sort(sample(1:1000, n_obs))
x_obs <- x_latent[obs_idx]

sigma <- 1
a <- 2

dists_latent <- rdist(x_latent, x_latent)
Sigma_latent <- sigma^2*exp(-dists_latent/a)
Z_latent <- t(chol(Sigma_latent)) %*% rnorm(1000)
Z_obs <- Z_latent[obs_idx]

x_pred <- seq(-5,5,length.out = n_pred)
dists_pred <- rdist(x_pred, x_pred)
dists_obs <- rdist(x_obs, x_obs)
dists_0 <- rdist(x_pred, x_obs)
Sigma_0 <- sigma^2*exp(-dists_0/a)
Sigma <- sigma^2*exp(-dists_obs/a)

Omega <- sqsigma*exp(-dists_pred/a)


mu_pred <- Sigma_0 %*% solve(Sigma)%*% Z_obs
Sigma_pred <- Omega-Sigma_0 %*% solve(Sigma) %*% t(Sigma_0)
sigma_pred <- sqrt(diag(Sigma_pred)+jitter)

uppers <- mu_pred+1.96*sigma_pred
lowers <- mu_pred-1.96*sigma_pred


f <- ggplot() + geom_ribbon(data = data.frame(x=x_pred, uppers, lowers), aes(x=x, ymin = lowers, ymax=uppers), fill = c_light_trans) + geom_line(data = data.frame(x=x_latent,y = Z_latent), aes(x=x,y=y), color = "gray") + geom_line(data = data.frame(x=x_pred, y=mu_pred), aes(x=x,y=y), color = c_dark) + geom_point(data = data.frame(x=x_obs, y = Z_obs), aes(x=x,y=y)) + xlab("t") + ylab("Z(t)") + labs(title = "Exact")

exp_cov_sdens <- function(omega, a, sigma){
  return(2*a*sigma^2/(1+a^2*omega^2))
}

phi_vec <- function(m, x, L){
  n <- 1
  temp <- matrix(rep(-99, m), nrow = m, ncol = n)
  temp[,1] <- L^(-1/2)*sin((1:m)*pi/(2*L)*(x+L))
  return(temp)
}

phi_mat <- function(m, x, L){
  n <- length(x)
  temp <- matrix(rep(-99, m), nrow = n, ncol = m)
  for (i in 1:n){
    temp[i,] <- L^(-1/2)*sin((1:m)*pi/(2*L)*(x[i]+L))
  }
  return(temp)
}


Delta_mat_exp_cov <- function(m, L, a, sigma){
  lambda <- ((1:m)*pi/(2*L))^2
  d <- exp_cov_sdens(sqrt(lambda), a, sigma)
  return(diag(d, m))
}

m <- 512
c <- 1.1
L <- c*5

Phi_obs <- phi_mat(m, x_obs, L)
Lambda <-  Delta_mat_exp_cov(m = m, L = L, a = a, sigma = 1)
Phi_pred <- phi_mat(m, x_pred, L)

Sigma_0 <- Phi_pred %*% Lambda %*% t(Phi_obs)
Sigma <- Phi_obs %*% Lambda %*% t(Phi_obs) + diag(jitter, n_obs)

Omega <- Phi_pred %*% Lambda %*% t(Phi_pred)

mu_pred <- Sigma_0 %*% solve(Sigma )%*% Z_obs
Sigma_pred <- Omega-Sigma_0 %*% solve(Sigma) %*% t(Sigma_0)
sigma_pred <- sqrt(diag(Sigma_pred)+jitter)

uppers <- mu_pred+1.96*sigma_pred
lowers <- mu_pred-1.96*sigma_pred


g <- ggplot() + geom_ribbon(data = data.frame(x=x_pred, uppers, lowers), aes(x=x_pred, ymin = lowers, ymax=uppers), fill = c_light_trans) + geom_line(data = data.frame(x=x_latent,y = Z_latent), aes(x=x,y=y), color = "gray") + geom_line(data = data.frame(x=x_pred, y=mu_pred), aes(x=x,y=y), color = c_dark) + geom_point(data = data.frame(x=x_obs, y = Z_obs), aes(x=x,y=y)) + xlab("t") + ylab("Z(t)") + labs(title = "Approximate")
f + g + plot_layout(nrow = 2)

I noticed that the lack of nugget effect adds a lot more instability into the linear algebra.

Some quick thoughts:

  1. I’ve been using the latent variable formulation even for gaussian outcome because that can speed up stan computations.
  2. The matrix \Phi_1 \Lambda \Phi_1^T is singular whenever M < n_1 because \Lambda is M \times M hence this matrix is at most rank M.
  3. I think your program can run because of the jitter you added to the diagonal. Usually the jitter is added for numerical stability, but in this case, even when that matrix is truly singular, adding small numbers to the diagonal will bring all eigenvalues to be positive, hence the matrix to be invertible. But just as you mentioned, I suspect this would bring numerical instability into the results therefore I’m a bit reluctant to use it. This is indeed one of the alternatives I’m considering. I think adding a small constant to the diagonal might not be a bad idea because if the matrix \Phi \Lambda \Phi is a good approximation of the GP covariance, the eigenvalues that I’m missing here in the decomposition would all be tiny. So adding a small number to the diagonal should still give me reasonable approximation to the true covariance.
  4. Another alternative I’ve been considering is to manually add a nugget even though it wouldn’t be well identifiable. By using a low-rank approximation to the GP kernel, I’m necessarily underestimating the covariance (in PD matrix sense). So if I add a nugget with a prior, and even try to “learn” its value, I can interpret it as trying to capture the part of the covariance matrix I’m missing.

I’m currently doing simulations to see whether one of these options gives better results.

Ah, I see it now. Sort of like the true GP has (theoretically) infinite flexibility to perfectly interpolate any amount of data but the truncated approximation removes some of this which means that a finite number of basis functions can only interpolate so many points.

I think actually the reason it works is because I accidentally assumed n_1 < M in the demonstration but the point. Oops. This is a good discussion though because I am also using HSGP for non-Gaussian outcomes.

Out of curiosity, if you implement the prediction approach in the latent variable GP section I linked earlier do you also run into instabilities?

For your interpretation … I’m more thinking about it from a feature representation perspective. We know that e.g., the gaussian kernel, is equivalent to using infinite number of features. With a low rank approximation, we are trying to capture the most important M features. I’m not sure about “can only interpolate so many points”, but I guess the approximation is only accurate up to whatever the M basis functions are able to capture.

Yeah I think cases where HSGP works the best is when M is far smaller compared to n_1 and n_2. If n_1 is so small that it’s smaller than M, I’d just use exact GP instead. And if n_1 is large, but the number of basis function is even larger … um I guess HSGP would still be worthwhile since it’d still be faster than inverting an n_1 \times n_1 matrix. Um interesting … so I guess one way to resolve this issue is to manually choose a larger number of basis functions than potentially needed so that M>n_1 … ? I’ll need to think more about it to see whether this would create any issues …

So far my simulation is still in the gaussian world where I have a nugget so that this isn’t an issue. My next step would be to move to a logistic regression setup and see how things work there. I’ll keep you updated if I notice anything.

Oh … if n_1 is very large and M > n_1, things wouldn’t work very well. Because to make predictions, I’d need to invert an M \times M matrix, which would be very computationally intensive… yeah, so this would still be a problem.

This Matrix would be inverted in generative_quantities, which doesn’t uses autodiff so the performance hit is not as dramatic as one would assume.

About the nugget: What’s missing is the proof the additivity of the covariance is also doable in case of HSGP.

Could you elaborate on your second comment about additivity of the covariance? My understanding is that just naturally carries forward from the usual GP marginal distribution argument.

We can add kernels, examples here:

https://www.cs.toronto.edu/~duvenaud/cookbook/

Let’s say we have a Kernel K, Nugget \sigma then we can create a new Kernel K_n as
follows K_n = K + \sigma I with I the Identity matrix same dimension of matrix K.

y \sim GP(\mu, K_n)
is the integrated out version of:
y \sim N(x, \sigma),\\ x \sim GP(\mu, K)
with x parameter(s) in Stan.

Yeah I agree. I guess what I didn’t understand is, what’s the problem with this?

If it’s natural to use nugget in my model, I’d just use it which would solve the prediction problem I mentioned. I was more wondering what’s a best practice when I don’t have a nugget in the model.

It’s similar to driving a car without shock absorbers—each bump and irregularity becomes more pronounced. I suspect that some of the noise may influence the curvature, leading to a ‘wigglier’ fit. This effect would depend heavily on the choice of kernel. For instance, a Matern12 kernel would be more sensitive to these fluctuations compared to an RBF kernel.

For count data, I would consider using a Negative Binomial or Generalized Poisson distribution for comparison.

There is no matrix inversion with HSGP, unless you in this mean something else than the Hilbert space approximate GP. With HSGP approach you get basis functions and out-of-sample prediction is like with any linear model (linear with respect to the parameters).

Thanks for your reply. I did see in the demo codes that out-of-sample prediction was done just as for a linear model. However, I’m not sure whether it properly accounts for all the uncertainty?

Also I noticed that in the 2020 paper, matrix inversion was used for prediction. See equation 37. Although I have to say that I’m having trouble deriving exactly these equations …