Approximate GPs with Spectral Stuff

Send it on, I’m curious to see. I’ll get a chance tomorrow or the day after to read through the paper. It looks interesting.

Me and my buddies have been reading about RKHSs all last week. I just got a new GP with a bounded domain sampling on the Westbrook problem (the goal is to see if anything changes). I’ll post the results when I get it working with the full dataset.

1 Like

Note that these are not something you can directly translate to Stan as James was using slightly different inference.

James Hensman wrote

I’m including two algorithms: the first is “variationally exact”, in the sense that it samples from the optimal distribution given the choice of Fourier basis. The second is simpler: it tells you how to generate a sample from the “approximate prior”. This is probably the easiest thing to get going in Stan, but I might be wrong.

I assume a 1D function. I think you will be able to apply to the additive case easily.

I assume a Matern 1/2 kernel for simplicity. I think you will be able to extend to the 3/2 and 5/2 cases easily.

numbered equations are from this doc:
https://arxiv.org/pdf/1611.06740.pdf

Let me know if this is clear (note that I’m away until Monday).

James.

==========
Algorithm 1.
( As seen in VFF for GP paper)

inputs:

  • Matern kernel parameters : sigma, lengthscale
  • Data set : X, Y
  • factorizing likelihood p(y_i | f_i)
  • v : a vector representing the GP latent variables
  • the number of Fourier basis to use : M
  • the bounds of the box on which the Fourier features are comnputed: a, b

returns:

  • L: the unnormalized log density that we want to sample
  1. Assert a < x_i < b \forall i
  2. Assert len(v) == 2 * M + (2 * order of Matern kernel)
  3. Construct alpha, beta (parts of Kuu), as equations 59, 60.
  4. Construct R from alpha and beta (eq. 72)
  5. u = Rv
  6. Kuu = diag(alpha) + beta * beta^T
  7. construct Kfu:
    • omega = [2 * pi * m / (b-a) ] m=1…M
    • Kfu_sin [m, n] = sin(omega_m * x_n)
    • Kfu_cos [m, n] = cos(omega_m * x_n)
    • Kfu = cat( Kfu_sin, Kfu_cos)
  8. mu = Kfu * Kuu^{-1} * u
  9. var = diag (Kff - Kfu Kuu^{-1} Kfu^T)
  10. L = log_mvn(v | 0, I)
  11. for i-=1…N:
    q(f_i) = N(f_i | mu_i, var_i)
    L += E_q(f_i) [ log p(y_i | f_i) ] (perhaps with quadrature)
    ==========
    End of Alg. 1
    ==========

==========
Algorithm 2.
Draw an approximate GP sample at some points X
valid if a << x_i << b and M is big enough relative to the inverse-lengthscale

inputs:

  • Matern kernel parameters : sigma, lengthscale
  • Data input points : X
  • the number of Fourier basis to use : M
  • the bounds of the box on which the Fourier features are computed: a, b

returns:

  • f: an approximate sample from the GP prior
  1. for i=1…(2*M+1):
    v_i ~ Normal(0, 1)
  2. Construct alpha, beta (parts of Kuu), (eqs 59, 60)
  3. Construct R from alpha and beta (eq. 72)
  4. u = Rv
  5. Kuu = diag(alpha) + beta * beta^T
  6. construct Kfu:
    • omega = [2 * pi * m / (b-a) ] m=1…M
    • Kfu_sin [m, n] = sin(omega_m * x_n)
    • Kfu_cos [m, n] = cos(omega_m * x_n)
    • Kfu = cat( Kfu_sin, Kfu_cos)
  7. f = Kfu * Kuu^{-1} * u
    ==========
    End of Alg. 2
    ==========

Notes:

You will need to take are to ignore the basis function at sin(0). Obviously it doesn’t do anything, and the corresponding element of Kuu can make things nasty.

I have ignored the fact that Kuu is a low-rank-plus diag matrix in some places: obviously care needs to be taken to ensure efficiency.

Once you have solved (efficiently!) Kuu^{-1} * u, the multiplication of Kfu on the left is precisely the non-uniform-fast-Fourier transform (NUFFT). This can be done very efficiently with minimal loss of accuracy, but I don’t have the expertise to implement it. My implementation is O(NM).

For anyone who digs up this thread in the fuuuuuutuuuuure.

I’ve been reading about RKHS a bunch since the original post. Two resources (other than the ones already listed) that have been really nice are:

  1. Yuedong, Wang’s “Smoothing Splines” book. Yuedong was a Wahba student, so it’s from that lineage. The book solves problems mostly from the perspective of optimization, but it’s got a ton of neat examples (and an R library, assist, that’s easy to use), and the way he goes about the RKHS (talking a lot about projecting things here or there) felt different enough from the other stuff.

  2. Steven Lalley’s “Introduction to Gaussian Processes” (a random pdf on line, as far as I can tell). The feel to this introduction is much more SDE-ish, so it’s clearer how probabilities work in to everything, but he’s got enough of the RKHS stuff that it’s not too hard to link it back to the Wahba/Yuedong stuff.

I repeated the Westbrook shooting % exercise with a model from Yuedong’s book to compare to the approximate thing. I also took the RFF code Flaxter posted and adapted it to this task.

The models are here:
Random fourier features: https://github.com/bbbales2/basketball/blob/master/models/rff_bernoulli.stan
Yuedong: https://github.com/bbbales2/basketball/blob/master/models/bernoulli_bernoulli_yuedong.stan

The code to run them is here:
Random fourier features: https://github.com/bbbales2/basketball/blob/master/westbrook_rff.R
Yuedong: https://github.com/bbbales2/basketball/blob/master/westbrook_yuedong.R

Just to be clear, the Yuedong model here is a full GP. No approximations. The Yuedong model chains took between 5-7 hours to run. The Random fourier features model with k = 80 took a tiny bit under an hour. The approx GP method takes a few minutes. So they’re all in the realm of feasibility for this data set, though clearly they perform a bit differently.

Worth noting, the Yuedong and R fourier features models had no divergent transitions (4 chains, 2k iter). The approx GP method consistently had 40-50~, even after I stopped trying to estimate the covariance scale parameter (the hyperparameter other than the length scale and iid noise – it causes there to be 100~ + divergences).

Interpret the results at your own risk :D!

2 Likes

Dear Ben,

hesitating to answer a great post, but it’s worth more to give a response than none. Currently, adopting your
program to a GLM problem, as far as I can tell already, the run time of resent approximation is much longer
according to the first post. Maybe it has to do with the cos, sin. These functions are vectorized now, BTW.

I think Aki’s post is very promising. I experienced, that the Matern Covariances behave more
nicely than the RBF kernel. Studying the Kalman recently, what you think of putting a process noise,
before using the log-likelihood?

I’m asking myself, what benefit these approximations have against splines? It’s very easy to model splines
with R’s library splines. Just use eg. ns(x, df=10) for natural splines or bs() for b-splines and one gets the corresponding
design matrix.

@Andre_Pfeuffer Yeah, I think the sines n’ cosines could go faster. I didn’t put too much effort into it :P. My gut feeling is it’s fast. It should be at least as fast as the approx. GP thing and be decidedly better on different data (for the same reason different basis work better in different places).

I experienced, that the Matern Covariances behave more nicely than the RBF kernel

That’s interesting. I need to get around to doing some experiments with these.

Studying the Kalman recently, what you think of putting a process noise, before using the log-likelihood?

What would this look like in Stan code? I think we’re trying to put process noise in the covariance multi-normal. I’d be a little hesitant to put it somewhere else.

I’m asking myself, what benefit these approximations have against splines? It’s very easy to model splines with R’s library splines

Yuedong’s stuff is splines. The answer to what’s better? I don’t know. Esp. considering these spline models in their optimization form take about zero seconds to run haha. I think if you cared more about your model than I do about Russel Westbrook’s shooting percentage, then you’d probly end up wanting to sample instead of optimize and the differences in splines vs. RBF/Matern GPs might be more clear in the context of the application.

I suspect splines aren’t gonna work as well as RBFs and the like for higher dimension inputs (even if the models do behave nicely in 1D). And ofc. RBF probly only works so far too. I’ve been reading these RKHSs from the perspective of replacing RBFs with polynomials, but I have a book on my desk which is about replacing polynomials with RBFs, so I guess it goes both ways haha.

For efficiency take attention to the different way of parameterization:

d = fabs(x[i] - x[j])
sigma = variance
https://en.wikipedia.org/wiki/Mat%C3%A9rn_covariance_function
Also check correctness before use.

  real Matern12(real d, real rho, real sigma) {
    real d_times_rho = d * rho;
    return sigma * exp(- d_times_rho);
  }
  real Matern32(real d, real rho, real sigma) {
    real d_times_rho = 1.73205080756887729352744634 *  d * rho;
    return sigma * exp(- d_times_rho) * (1 + d_times_rho);
  }
  real sqrt5() {
    return 2.2360679774997896964;
  }
  real Matern52(real d, real rho, real sigma) {
    real d_times_rho = d * rho;
    return sigma * exp(- sqrt5() * d_times_rho) * (1 + sqrt5() * d_times_rho + 5.0 / 3.0 * square(d_times_rho));
  }

(It was too late last night, my mistake.) I meant sensor noise

 vector[N] noise;
 real<lower=0> sigma_noise;
 noise ~ normal(0, 1);  // or student_t
 sigma_noise ~ cauchy(0, 1);
  y ~ bernoulli_logit(fhat + noise * sigma_noise);

Yuedong’s stuff is splines. The answer to what’s better? I don’t know.

Thought again: If using splines the “knots” are fixed a-prior, but if you sample in approximation GP’s
each iteration you’d normally sample new knots (anchor points), in MCMC style, eg. in Gibbs sampling.
Ideally the number of “points” M in first first post should vary, to have a trade-off between the frequencies.
As far as I remember correctly, Michael have a paper out, that is not “working well” in HMC.

1 Like

I’m trying to read a bit more on this. Where does this come from in the Yuedong book? Is this different from the exact GP model that you tested on the smaller dataset?

If you have a copy of the book, page 22, section 2.6. It’s really Wahba stuff I think.

The exact GP on the smaller dataset was referring to an GP with RBF covariance. Mathematically they are definitely different models (the Yuedong stuff is on [0, 1], GP is on R), but the practical differences might be more subtle (edit: change ‘more subtle’ to ‘different’).

got it. So the Wahba stuff is a covariance kernel over [0,1], and the GP stuff is defined on R, then transformed to [0,1] using inv_logit.

Why is the Wahba stuff exact? It seems if it only goes up to the fourth Bernoulli polynomial, it would also be an approximation.

then transformed to [0,1] using inv_logit.

Nono, I mean the domain is [0, 1] for the Wahba stuff and R for regular RBF GPs. They both can have anything for their outputs.

Maybe exact is the wrong word for a non-parametric model fit on some data that comes from a process I don’t understand :D, but I’m not having to do numerical approximations to implement it and use it on that data. I think that’s what I really meant.

For lack of better citation, it’s exact in the sense here: Fitting GP with noisy observations of time . If we try maybe we can summon a @anon75146577 . I had trouble connecting all the math-bits together. Everyone approaches these things differently, but from a random variable perspective (to compliment the optimization approach of Wahba) I really liked Steven Lalley’s “Introduction to Gaussian Processes”. Check eq. 3.3 and the stuff around it for a wild time.

Ah, thanks for clarifying. My intuition is that when m = n, all of the approximations would become exact, but with different posterior geometries, which could be useful in reformulating models to avoid divergence transitions, but I’d need to work through all of the math to confirm.

I wish I had saw this stuff last week. I was just in Santa Barbara for a program at KITP - it would have been great to meet up.

I wish I had saw this stuff last week. I was just in Santa Barbara for a program at KITP - it would have been great to meet up.

Oh dang! Missed connections. Next time you come through, send me a message. We get up to all sorts of modeling shenanigans here.

It depends on what you mean by “Wahba stuff” and what you’re trying to compute.

You will not get the right answer* if you use the basis she expands the solution to the spline optimisation problem in (which she derives using the representer theorem for the RKHS). You will get the exact answer if you either use the appropriate covariance function or if you use the Markov property in clever ways.

  • right answer: posterior corresponding to the GP prior that Wahba uses to get her MAP estimate.

This is truth. The efficiency of spline models comes from the fact that the underlying probability model is Markov, which means that you can express the solution using a local basis (ie you can evaluate the spline without having to “touch” every data point).

As the dimension grows, so too does the size of the neighbourhood (ie how many nearby points you need to use to evaluate the spline) that you need to ensure that your posterior is continuous.

this can best be seen by considering models on lattices, that is take [0,1]^d and divide it into N^d blocks and consider the types of functions you get by letting N->infinity. The following things are true:

  • When d=1 (ie a line), if you only use the first-order neighbours (the one to the left and the one to the right) you will get a posterior that is continuous (but not differentiable. Think Brownian motion)
  • When d=2, only looking at first neighbours (north, south, east, west) will not give you a continuous function. You need to look at your neighbours and your neighbours’ neighbours.
  • In general, the posterior you get as N-> infinity will have k-d/2 derivatives if you consider k-th order neighbourhoods (if you do it the right way, that is).
  • So as d grows, to keep Wahba splines working right, you need very high-order nighbourhoods which wipes out the computational benefits.

If you want to maths this differently, k is the power of the Laplacian in penalised likelihood.

There are, of course, things you can do differently, like taking tensor products of d one-dimensional splines. This will be computationally cheap, but the resulting GP is very very smooth.

I would be very keen to see an implementation of the Variational Fourier Features algorithm in Stan, particularly if it allowed for multiple dimensions in the independent predictive variables.

@bbbales2

As I understand it, a Levy process can always be decomposed into a gaussian process and a point process with jumps. Only Gaussian processes are continuous. See this section:

Yes…

Although a perhaps more useful way of constructing (a subclass) them is through subordinated GPs (i.e. Have an independent random time-change). See type-G Levy processes here or google

@anon75146577 That’s a nice paper, thanks for the link. If you wanted to actually do inference on them… I was thinking about using this jump-diffusion fact as a way to formulate things for Stan. The main things of interest are

  1. Diffusion rate sigma,
  2. Rate of jumps r
  3. Distribution of jump sizes (something you need a model for)

Imagine D is a vector of data, r is the rate (assume constant, but r could be a function of time)

for i in 2:N {
   target += log_sum_exp(normal_lpdf(D[i] - D[i-1] | 0,sigma) + log1p(-r), ....)
}

Where … is the distribution of jump sizes + log(r)

Just replace sigma with sigma[i] and put a horseshoe on the jump size if you want a process with jumps (this is type-G)

Jump times are a point process which is a high-dimensional discrete parameter, so you’ve got to sum it out somehow if you want that sort of process

Replying from the fuuuuuutuuuuure … :)

Anyway, I tried to make sense of https://github.com/bbbales2/basketball/blob/master/models/rff_bernoulli.stan in light of the papers “Random Features for Large-Scale Kernel Machines” and “On the Error of Random Fourier Features,” and I think there may be a mistake.

I noticed that the factor scale is given as sqrt(2.0 / N), where N is the length of the input data. Judging from the paper “On the Error of Random Fourier Features”, scale should be 1/sqrt(K), where K is the number of draws of omega. Notice that in equation (1) of that paper, the variable D is twice the number of draws K, since the subscript for omega there only goes to D/2, so sqrt(2/D) corresponds to 1/sqrt(K).

Also, I’m still unsure how fhat in your Stan file is an approximate sample from a Gaussian process. It looks like fhat[i] is the dot product of z(x[i]*bw) and w = [beta1[1] beta2[1] ... beta1[K] beta2[K]], where z(…) is the vector from Algorithm 1 of “Random Features for Large-Scale Kernel Machines” and beta1 and beta2 are from your Stan file. Judging from the 4th or 5th paragraph of the paper “Random Features for Large-Scale Kernel Machines”, that dot product is supposed to be an approximation to a draw from a Gaussian process (I think?), but the paper seems to gloss over how that dot product and the draw relate. Maybe I’m just unfamiliar enough with Gaussian processes to not get the connection.