 Approximate GPs with Spectral Stuff

I dunno if this is useful to anyone, but something I’ve found is useful with GPs in Stan is doing a spectral approximation sorta thing. The amount of stuff I see that looks like this in GP-land makes me think it’s probably pretty common somewhere, but I haven’t seen it around here and I think it’d be particularly useful in sampling land (cause of how much time sampling can take).

So if we had a Bernoulli outcome dependent on some sorta underlying latent variable that is a function of our independent variable x, we might have this model:

data {
int N;
real x[N];
int y[N];
}
parameters {
vector[N] z;
real<lower=0.0> sigma;
real<lower=0.0> l;
}
model {
matrix[N, N] Sigma = cov_exp_quad(x, sigma, l);
matrix[N, N] L;
for(n in 1:N) { Sigma[n, n] = Sigma[n, n] + 1e-10; }
L = cholesky_decompose(Sigma);
sigma ~ normal(0, 1);
alpha ~ gamma(2, 2);
z ~ normal(0, 1);
u = L * z;
y ~ bernoulli_logit(u);
}

There’s the issue that if the outcome is bernoulli, it takes a fair number of outcomes to know what’s going on, and this problem will start to get slow if N gets more than a few hundred. I was working on some other problems where I was interested in the latent variables, and things were getting slow like this. I don’t think it’s just the decomposition either. Having hundreds of parameters is hard for Stan to sample.

Anyhow, the trick for u = L * z; to work is that L needs to be a square root of the covariance matrix. Another way to get a matrix square root (I don’t think matrix square roots are unique) is to use an eigenvalue decomposition (https://en.wikipedia.org/wiki/Multivariate_normal_distribution#Drawing_values_from_the_distribution). For RBF kernels, the eigenvalue decompositions are known (http://www.math.iit.edu/~fass/StableGaussianRBFs.pdf).

That doesn’t quite help us cause the eigendecomposition is an infinite sum (eq. 4.37 in http://www.gaussianprocess.org/gpml/chapters/RW4.pdf), but if we just choose a cutoff for that sum, we get a convenient approximation. If we do that, we can build an approximate NxM matrix L (where N is the number of data points and M is the number of eigenfunctions to include in the sum) that approximates our original RBF covariance matrix K.

What’s cool about all this is that

• We can construct our approximation to the square root of our covariance matrix directly (no decompositions necessary). It consists of Hermite polynomials, so we can do this recursively. Building the matrix is O(NM) (as compared to O(N^2) for building the full covariance matrix and O(N^3) for doing the decomposition).

• We control how many latent variables we want to estimate (so if things are really low frequency, we don’t include many)

I think this is basically the same as a Karhunen-Loeve transform for the RBF. The math that goes into this comes up ton in all this stuff that touches RBFs.

Anyway, so as a real quick practical example of this, I took Russell Westbrook’s made-missed shots from the 2015 season (or at least most of it – not sure if the dataset is complete). It’s an r/nba think to talk about how in 4th quarters of games Westbrook turns into Worstbrook (as opposed to Bestbrook), so the idea is maybe we could estimate Westbrook’s shooting percentage not as a single number, but as a function of the game time (and see the difference in Bestbrook and Worstbrook). There are 1438 observations in the dataset. I’m estimating only the length scale for the RBF. With this dataset, estimating length scale + the covariance scale leads to divergences (even in the exact model – checked on a smaller dataset). I used ten eigenfunctions in the approximation. The four errorbars show the estimates you’d see just estimating his shooting percentage by grouping on quarters (that’s a 95% interval for those). I don’t think this plot makes a terribly compelling argument for the existence of Worstbrook vs. Bestbrook, but I guess this is about the algorithm so I’ll avoid talking about the weaknesses with this model :D (2s and 3s lumped together, for starters).

The code to run that is here: https://github.com/bbbales2/basketball/blob/master/westbrook.R
The data is in that repo as well.

Anyhow, there’s a fair number of details that go into this that I haven’t listed now. Lemme run through them:

• There are two fixed parameters that go into the approximation, one is the number of eigenfunctions and the other is a ‘scale’ parameter. They are explained here (http://www.math.iit.edu/~fass/StableGaussianRBFs.pdf). The easiest way to understand what is going on is to read section 3 in that and then experiment. I have a script for this here: https://github.com/bbbales2/basketball/blob/master/eigenfunctions.R . What happens in a bad approximation is that the approximate covariance matrix goes to zero for entries corresponding to points that are far away from zero. (read next point)

• The approximation will work much better if the mean of the independent variable x is zero! The approximate kernels are not translation invariant or stationary or however you think about it! Here’s a picture of the first ten eigenfunctions (scaled by the square root of their eigenvalues). The approximation will fail if there is data out near the boundaries (< ~-1.2 or > ~1.2) cause all the eigenfunctions are near zero there: • Here is an approximate covariance matrix with the domain extended to see where the approximation fails (same expansion as above):

• You can monitor the accuracy of this approximation online. If L is the approximate square root of the covariance matrix, max(abs(cov_exp_quad(x, sigma, l) - L * L')) is the error. You can do this in a generated quantities block. In reality though, you want to choose a subset of your points x to do this. Building the NxN covariance matrix in the generated quantities block can quickly dominate the time spent in inference. Here is log10 of the max error in the approximation for the plot above (plot on the left is the Westbrook inferred length scale for the time axis – for the inference I mapped 0->48 minutes to -0.5 -> 0.5, and the length scale is relative to this): • The expansion behaves like a Fourier transform in that the first few modes are lower frequency and the later ones are higher and higher frequency. So approximating lower frequency things is going to be easier.

• The run with M = 10 (cores = 4, chains = 4, iter = 2000) takes about a minute for this data on a server computer with 2.3Ghz cores. With M = 20 it takes a little over 2 minutes (and the max error in the approx is down around 10^-14 consistently). The performance usually scales worse than linearly in the number of eigenfunctions included.

• With the eigenfunction version of the problem, you’re working with something that is more like basis functions than Gaussian processes. It’s easy to interpolate and make predictions at points away from where your data is. Each point just requires another length M dot product.

• It’s easy to get derivatives and integrals (just think of things as basis functions that happen to be a product of exponentials and hermite polynomials and have at it – derivatives can be written down analytically, probly can get definite integrals too).

• I honestly don’t know how using this compares to just using a nice set’o basis functions. Probably computationally about the same thing. I think the RBF kernel is easy to think about though, which is nice.

• Multidimensional input works out fine. You can just build two Ls for different independent variables and then do the elementwise product to get the multidimensional one (this might not be clear, but it is covered in http://www.math.iit.edu/~fass/StableGaussianRBFs.pdf). If you’re interested in sums of RBF GPs, you’ll need to estimate the latent variables of both and sum them (I don’t think there’s a shortcut for this).

Since it’s so easy to get predictive things out of this approximation, you have a ton of flexibility about using this in weird models. Here I fit a spatially varying diffusion coefficient in a simple PDE IVP (Stan model: https://github.com/bbbales2/gp/blob/master/models/diffusion_gp.stan – if anyone wants the R to run this let me know).

Equation is dc/dt = d(D(x) dc/dx)/dx
With initial conditions: c(0, x) = 0
And Dirichlet boundaries: c(t, -0.5) = c(t, 0.5) = 1)

I used method of lines discretization at like ~10 points on the x axis. Only the data at the final time is used. Grey areas are 95% confidence intervals, red line is truth used to generate the data, dark line is the mean of the posterior: Or you can estimate a concentration varying diffusion coefficient (dc/dt = d(D(c) dc/dx)/dx, and this uses 20 points on the x-axis if I recall correctly https://github.com/bbbales2/gp/blob/master/models/diffusion_state_gp.stan): Those were both done with simulated data and I chose two conveniently pretty results :D, so caveat emptor if there’s anyone that cares about diffusion constants floating around here.

Anyhow, that’s a big wall of text hopefully it’s useful to someone. I dunno what the actual name for this stuff is. This seems like maybe it’s the fancier version: http://quinonero.net/Publications/lazaro-gredilla10a.pdf

Edits: Added some more plots to explain how error in the approximation comes about and fixed a few issues I noticed.
Edits 2: Fixed a sampling statement. Tnx Arya

12 Likes

This looks super useful and should form the basis for a case-study, no?

Yeah it wouldn’t be too hard. I wanted to get this out there to see what other people thought about it.

The basketball problem is almost a little contrived. The diffusions are more realistic (one of my buddies here is doing experiments like this).

I’m not sure where something like this would be more useful than doing some other sorta basis function expansion. ax + bx^2 + cx^3 + d etc. In the basketball case, the second seems as good? Just slap a bunch of bases in there and you’re (hopefully) gonna get the same answer + have some more control on priors.

Part of the reason I was looking at GPs is that they seemed less-parametric than picking basis function sets and doing regressions on them (the term non-parametric seems weird to me in a field where all we care about are parameters in one way or another). So the advantage of this is that it comes with the length scale interpretation? Which seems dubious haha.

It’s certainly efficient with data though, which is nice.

Hi, very interesting.

How does this compare to sparse gaussian processes where one needs to find M (<<N) anchor points to approximate full gaussian process? This way I could skip the optimizing step to find the anchor points?

Haha, I should actually get unlazy and do some comparisons.

My inner lazy would like to say that choosing anchor points is hard and not do it, but it seems like choosing anchor points would be really easy in the basketball example. Maybe this is cause 1D is easy to interpolate? And it’d be harder in higher dimensions? But Gaussian processes to me seem a lot like interpolation, and I dunno if GPs in general Just Work ™ in higher dimensions anyway, so maybe that’s an unfair way to judge them.

The language at the start of Chaper 8 in Rasmussen (http://www.gaussianprocess.org/gpml/chapters/RW8.pdf) makes it sound like the Nystrom method uses M points to figure out an approximate spectrum for N points, which seems like a pretty reasonable plan, but to do that you’re running with an eigendecomposition live (M^3).

I think the key to getting these approximate methods to work is that even though they use low order approximations (M), you need to somehow still be using data from all N points. 100 shots from Russell Westbrook isn’t gonna give you a good picture of his “real” shooting percentage. There’s something in 8.3.4 of the Rasmussen book (Projected Process Approximation) where he mentions this. I assume the approximations are somehow handling this, but I haven’t coded em’ up and tried them so I guess I don’t quite know for sure how it works.

Another advantage to these rank M approximations is that they’re still actually GPs :D. They’d be translation invariant and such.

Dear Ben,

super interesting.
Is L * L’ the only way to get the variance? I think the variance may be useful, eg.
when fitting a beta-(binomial/bernoulli) distribution instead of a pure bernoulli distribution.

Is it that these Eigenfunctions don’t have a localisation? How about using discrete wavelet transformation
as basis? Cascading filterbanks and let the optimization stop at a certain level.

Andre

I think in terms of other variances and such, the most interesting stuff I’ve found is filed under the Karhunen-Loeve transformation:

I dunno how far different expansions would take you though. A wiener process is super high frequency, so you’d not expect a spectral representation to be the way to go. The eigenvalues decay only as 1/k^2. So the square root decays as 1/k, which doesn’t look good as far as convergence goes. I didn’t mention it, but to build the L you gotta scale the eigenfunctions by sqrt(eigenvalues), if that makes sense.

First time I saw the expansion stuff like that was in this book: http://press.princeton.edu/titles/9229.html (Dongbin Xiu, Numerical Methods for Stochastic Computations). Dunno why the website is down for me now. It might have some stuff in it.

I’m not familiar with wavelet transformations and such :D, but I think for this approach to make sense you gotta have a covariance matrix in mind that you really like.

There approaches are all over the stochastic process and machine learning literatures (although often you see lots of math and then questionable jumps to the utility of that math in practice). The mathematics is typically presented in terms of reproducing kernel Hilbert spaces which may help in a literature search. One example result that you may want to look into are “Fourier features” in particular “random Fourier features”.

There is a sparse gaussian process implementation from Max Joseph, which bases upon Finely:

http://mbjoseph.github.io/2016/08/14/gpp.html

The difference with choosing anchor points / knots is that if your signal contains locally useful information,
the anchor points can be placed/estimate to exploit these “higher” frequencies. An optimization
regarding the max. “information”. Whereas if we use eigenvalues, eg. decomposition of the Fourier spectrum,
we loose that localisation.

1 Like

The random Fourier feature approach that @betanalpha mentioned above seems to work nicely in Stan–here’s a little example I put together which should also probably be turned into a case study at some point: https://bitbucket.org/flaxter/random-fourier-features-in-stan

Regarding the discussion of inducing point type approaches, it’s worth mentioning also that in practice, good ol’ RBF networks can work pretty well, see a recent publication where we used them in Stan: https://arxiv.org/abs/1705.04293

…which is not to say that the explicit expansion you’re mentioning isn’t neat! I was very interested to discover it for a recent paper I’m revising at the moment (https://arxiv.org/abs/1610.08623) where we needed an explicit Mercer expansion. Fasshauer has written a whole book with McCourt about these types of things, and it’s worth checking it out: http://www.worldscientific.com/worldscibooks/10.1142/9335

@Andre_Pfeuffer and @flaxter thanks for the links to the Stan models. Makes it way easier to try this stuff out. Good stuff.

@Andre_Pfeuffer Oooh, I see what you mean with the localization. I never used wavelets before but I think I see the distinction. Makes intuitive sense with the knots too.

@flaxter (from the “About this Book”) “In an attempt to introduce application scientists and graduate students to the exciting topic of positive definite kernels…” Somehow I think you already gotta have your head in the kernel punchbowl for this to be exciting haha. That Poisson paper looks pretty cool too. I printed off a copy for reading later.

Yo @flaxter I read the Poisson paper. I’ve got a few questions if yah don’t mind. Not really related to the goal of the paper but on the RKHS stuff :D.

So when you have a domain and a positive definite kernel on that domain, you have an RKHS.

For usual RBF stuff, the domain is the real line and the kernel is exp(-(xi - xj)^2 / (2l^2)).

The paper talks about Sobolev spaces. So if we have a Sobolev space p = 2, then we’re talking about functions on a compact domain where we have control of the L2 norms of the first k derivatives of the functions. This compact domain + the norm (and the inner product that comes with the p = 2) defines the kernel, right? And if so, does this kernel have to be represented as an infinite sum? Is there no way to write it out in closed form like for the RBF?

Also, at some point in the paper you talk about change of measure stuff. When I was working with the RBF eigenfunctions it seemed weird to me that they were only orthonormal under a weight function. Is there any intuition or probabilistic interpretations for this weight function? Usually I think of the independent variables in the GPs as just fixed, but if there’s a weight function floating around it sure seems like maybe there’s an probability interpretation there (about the distribution of the independent variables). And if there’s a probability it sure seems like something that I’d put in my Stan models.

Something I guess I missed with my example is that the RKHS is domain + kernel (not just kernel). Clearly the domain of the basketball example is bounded, so it seems like I should be modeling it differently. Maybe that would affect how the uncertainties get relatively large at the boundaries (which seems like a fluke with the model rather than how a real basketball player would play).

1. There’s a form for that kernel in terms of Bernoulli polynomials (see Wahba 1990 chapter 2, where I got those from, and by the way, this is a really great book, if you want to look at one book about RKHS this is the one!), but I don’t think there’s a closed form expression for Bernoulli polynomials. Maybe there’s a good approximation? In any case, I didn’t find these to be terribly useful kernels because I couldn’t figure out a sensible way to change the lengthscale. I’d use something else for a periodic kernel.

2. For a particular Mercer expansion, there’s a particular measure (weight function), but there’s nothing special about this particular measure as RKHSs are unique. See Section 6.2 of http://www.stats.ox.ac.uk/~sejdinov/teaching/atml14/Theory_2014.pdf

3. Most people treat the index set (independent variables) of the GPs as fixed, but you could imagine putting a distribution on them. I’ve seen this done for example in GP-LVMs, but I’m sure it shows up in other places. I don’t think this is directly related to the measure above.

4. I guess the answer to your question about domain + kernel vs domain is that if you have prior information on the domain being bounded then it’d be good to use that information somehow, and if you can put it into the kernel, great. I don’t know of someone making this connection using the underlying RKHS measure, but it’s not a bad idea! For my paper thinking about the domain was unavoidable because you need to define the domain for a Poisson process and integrate the intensity over this domain to calculate the likelihood. But I will certainly admit that I usually ignore the domain issue when it comes to using GPs more generically.

Good stuff thanks for the answers.

One more question, and this one might be a bit toady, but I guess it has to do with interpreting whatever function comes out of this as a Gaussian process or not. Is there a situation where the underlying randomness is not a bunch of normal distributions? Or is there a situation where we aren’t building a covariance matrix to plug into multi_normal?

I had an SDEs class and everything in there seemed to be about transforming randomness to look like Brownian motions, and the Brownian motion itself was motivated by how physically reasonable it seemed (and presumably how nice the Math was courtesy that).

It seems like the Gaussian process interpretation has me locked in to normal priors on the latent variables. If I were to interpret the fits I’m doing as just a bunch of basis functions with linear coefficients though, I’d be able to play all sorts of games with the priors (like we do in regular linear regressions).

That make sense? I could just be missing something obvious.

No. That is literally the definition of a Gaussian process!

It’s hard to have any other measure on a Hilbert space (let alone a space with even less restrictive topology). One way of thinking about it is that as the dimensionality increases it’s nearly impossible for a central limit theorem to not kick in.

cool, thanks

No longer a GP, but there are other stochastic processes with non-normal increments such a Cauchy process, or Lévy processes more generally. I’ve seen them used in ecology, for example studying patterns of optimal foraging behavior. Let me know if you figure out how to infer these in Stan! It’s on my todo list, but haven’t had a chance to look into it yet.

@betanalpha Haha, fair enough, makes sense.

@aaronjg Oh yeah, that’s what they’re called. I checked out a book on Levy processes from the library once. That was some heavyweight stuff! I’m not sure I made it past the introduction. I saw in a guy’s thesis once where he fit a Levy process and showed how it outperformed something else. I cling to that plot as hope that there’s something there worth knowing haha.

Yeah if you get around to figuring out how to fit a Levy process, lemme know. Sounds groovy.

Those who are interested in spectral approximations for GPs should check this https://arxiv.org/abs/1611.06740
It has a lot of useful explanations, and although they favor variational approach, I talked with James Hensman, and it would be possible to pick the useful parts for Stan, too. If someone is interested, I can send some simplified pseudocode James wrote for me (but which is not included in the manuscript).

Aki

2 Likes