# Hermite polynomials and their inverse Matrix (continued from Approximate GPs with Spectral Stuff)

#1

Continuing the discussion from [Approximate GPs with Spectral Stuff]
(Approximate GPs with Spectral Stuff)

Checking the code of this wonderful post: Its correctly implemented. I’m struggling
with the inverting the approximation matrix produced by approx_L.R.
If I calculated the eigenvalues with eigen(L%*%t(L)). They are off
the ones in the paper. The approximation however is accurate.
I need to use the inverse to calculate the Analytical Form of Joint Predictive Inference.
The Predictive Inference with a Gaussian Process doesn’t lead to good results
in my case.

Where is my mistake?

The paper is referenced here:

Here’s my verification so far.

library(mpoly)
cov_exp_quad = function(x, sigma, l) {
outer(x, x, function(xi, xj) { sigma^2 * exp(-(xi - xj)^2 / (2 * l^2)) })
}
approx_L.R = function(M, scale, x, sigma, l) {
epsilon = sqrt(1 / (2 * l^2))
alpha = 1 / scale
beta = (1 + (2 * epsilon / alpha)^2)^0.25
delta = sqrt(alpha^2 * (beta^2 - 1) / 2)

N = length(x)
Ht = matrix(0, nrow = N, ncol = M)
xp = alpha * beta * x
f = sqrt(epsilon^2 / (alpha^2 + delta^2 + epsilon^2))
Ht[, 1] = sqrt(sqrt(alpha^2 / (alpha^2 + delta^2 + epsilon^2))) * sqrt(beta) * exp(-delta^2 * x * x)
Ht[, 2] = f * sqrt(2) * xp * Ht[, 1]
for(n in 3:M) {
Ht[, n] = f * sqrt(2 / (n - 1)) * xp * Ht[, n - 1] - f^2 * sqrt((n - 2) / (n - 1)) * Ht[, n - 2]
}

sigma * Ht
}

l <- 1.0 /sqrt(2)
scale <- 1.0

epsilon = 1 / (sqrt(2) * l);
alpha = 1 / scale;
beta = (1.0 + (2.0 * epsilon / alpha)^2)^0.25;
beta <- (1.0 + (2.0*epsilon/alpha)^2)^0.25

delta =  alpha* sqrt((beta^2 - 1.0) / 2.0);

M <- 10
la<-sqrt(alpha^2/(alpha^2+delta^2+epsilon^2))*(epsilon^2/(alpha^2+delta^2+epsilon^2))^(0:(M-1))

X <- matrix(NA, length(x), M)
for(n in 1:M) {
X[,n] <- sqrt(beta/2^(n-1)/gamma(n))*exp(-delta^2*x^2)*as.function(hermite(n-1,kind="h") )(alpha*beta*x)
}

L2<-X %*% diag(la) %*% t(X)
L2.s<- X %*% diag(1/la) %*% t(X)

L<-approx_L.R(M,scale,x,1.0,l)
L2.inv <-  (cbind(X,matrix(0,100,90))) %*% diag(c(1/la,rep(0,90))) %*% t( cbind(X,matrix(0,100,90)))
L2 %*% L2.inv

#2

Sup Andre! Going to Helsinki?

My math is pretty questionable, but I’ll give it a go. My guess is this has something to do with the difference between the integral eigenvalue problem (which the eigenvalues eqs. in that paper are for)

\int K(x, z) \phi_n(z) p(z) dz = \lambda_n \phi_n(x)

And the matrix eigenvalue problem that eigen is solving

\hat{K} v_n = \lambda_n v_n

for \hat{K} being an approximation to K at a bunch of points. There’s that weird weight function (p(z) they call it in that paper) that pops up that is one difference. I dunno much about it.

If it’s not clear for the RBF (and dunno why it would be) why these two problems would be different, then think about if you had K(x, z) = 1 for (x^2 + z^2) < 1 and K(x, z) = 0 otherwise. If you built a square matrix from this by choosing points all within that circle, then your \hat{K} is a matrix of ones, which definitely seems like it’s gonna have different looking eigenvectors/eigenvalues than the continuous K thing. That was some High Quality Math :P.

If you tried to think about what the eigenvalues of K(x, z) = 1 would look like (1 everywhere), then it’s clear why that weighting function p(z) needs to be there (to make sure stuff is actually integrable).

Could that be the difference?

#3

I’d loved to go, missing the sponsor. :-(

I don’t understand the paper in some details. I tried myself with voodoo haha,
about scaling: alpha / sqrt(pi) * exp(-alpha^2*x^2) / la
nothing makes sense.

I not only need the eigenvalues for inverting, but also for diagonal element updates of the matrix,

since A = Q * (E + diag(lambda)) * t(Q),
should be A + diag(lambda)

#4

Bummer, dunno if I’ll be there either.

Huh, dang, yeah since the approximate square root isn’t triangular or anything like that, I’m not sure it helps any with these things. And they definitely aren’t similarity transforms or anything nice like that either.

That book @flaxter linked to (https://www.worldscientific.com/worldscibooks/10.1142/9335) had tons of info in it about RKHSs and stuff. You might thumb through the index and see if it gives you any ideas.

What is it you’re going for?

#5

sure the approximate square root L isn’t triangular, but the approximate Matrix A = L * L^T
is. Thus the eigenvalues M + 1 … N should be positive and close to 0. If 0 the matrix is singular.
It would be better to construct the matrix like this, that diag(A) = I. Eigenvalues of M+1 … N = eps >0.

I adapted following paper to be able to use distributions like logit, poisson, etc.
I discovered that these GLM stuff usually overfits. A noise parameter is need.
I think its called nugget parameter.

However, the eigenvalue/vector decomposition is slow and not very stable. This is
not the case, if we could construct a reasonable eigenvectors/-values decomposed
matrix beforehand.

#6

If L is the L from a Cholesky decomposition, yeah, that’s lower triangular, but I dunno if you can easily get to them from non-square matrices from the expansion.

Oooh, I see. Do you need all the eigenvalues/eigenvectors? Or just the first few?

There’s a section on using SVDs in that Fasshauer book, but I dunno if they’re coming at it from the same angle as you. SVDs certainly aren’t cheap.

How many hyperparameters you have on these GPs? If by some accidental luck you only have like one or maybe two, then maybe you could try building lookup tables to interpolate on the quantities you want?

I did this once for some 1D RBF GPs. The input to the lookup table was the length scale and the output were the Cholesky factors (so you don’t have to run Choleskys[ies?] at runtime – just interpolations). I just stole some eqs. for cubic interpolation off Wikipedia and added some custom autodiff.

Just going by this: Bad LKJ Initialization in model from “Fast hierarchical Gaussian processes”, it looks like you have one hyperparameter? bw1? var1 could come outside the eigenvalue decomposition? I dunno. You can do a better job than me of knowing if this makes sense or not with what you’re doing.

I talked to Dan and Aki about this and they basically said this sorta trick isn’t too uncommon, it’s just really limited because interpolation is hard beyond a couple dimensions.

Anyway here are some plots of the lookup table GP vs. brute forcing the full GP:

And here is the stan model: https://github.com/bbbales2/gp/blob/master/models/cubic_interpolated_gp.stan

I did the interpolation with some custom C++ stuff. Matrix stuff can really clog up the autodiff stack: https://github.com/bbbales2/gp/blob/master/models/cubic_interpolated_gp.hpp

And here is the R test code: https://github.com/bbbales2/gp/blob/master/test_interpolate.R

#7

And there’s no way that doing cubic interpolation on orthonormal matrices makes more orthonormal matrices :/, but maybe there’s a way to rephrase things a bit

#8

Thanks Ben,

I don’t want to stress that topic too much. I think most is said already. Thanks!!
I’ll test the interpolation, looks promising.

The large eigenvalues/eigenvectors are associated with low frequencies, so we would only
need the largest 5 - 10 maybe. The high frequency noise could be absorbed by a random intercept
eg.
I have the feeling we are reinventing splines, thin plate splines (R-library mgcv) is close to that
what we are doing here.

For those who read this post in the future. I figured out a patch:

Ben’s approx_L can be used by determining the length-scale scale
In another Stan model a eigenvalue/eigenvector decomposition
in transformated data section with this fixed scale parameter.

Cutting small eigenvalues and do the sampling according to:

The approx. inverse can be computed by:

Overall one might prefer splines or wait. Stan’s parallel support is coming soon.