Question about Robustness of Gaussian Process Regression under Model Misspecification

Hello,

I am interested in fitting the following Gaussian process regression model: y_i = f(x_i) + e_i, where x_i is a p-dimensional vector of covariates, and e_i is iid N(0, s2). Note that f(.) is an unkown function of the covariates, and I want to use gaussian process regression to nonparametrically estimate f, then test if f is significantly related to y(i.e. is the mean of f different than 0? is its variance different than 0?). Basically I want to use GP regression to conduct a nonparametric global omnibus test to see if my covariates X are significantly related to Y.

I notice in the gaussian process regression literature that the noise is often assumed to be iid N(0, s2), however, this assumption may be unrealistic in practice. It is my understanding that if this noise assumption is violated, then any testing I do of f will likely have poor frequentest coverage rates (i.e. inflated type-1-error rates).

Therefore, my question is as follows: is anyone aware of any literature that discusses the properties of the gaussian process regression model given model misspecification (e.g. when the noise assumption is incorrect)? In particular, is it possible to still conduct valid inferences on f even when my noise assumption is violated? If not, is there anyway to modify the GP model (perhaps through some sort of robust sandwich covariance estimator) to obtain more robust inferences on f given non iid noise?

Thanks!

I doubt very much that this is a well posed problem, unless you have a very very specific form for the covariance function for creating f.

Note that it’s always possible to define a function f*(x_i) such that f*(x_i) = y_i at the specific sample points you have, provided they’re distinct. The question comes down to whether sample paths that go exactly through those points are favorable or unfavorable under your covariance function. If you force these functions to be very smooth by confining your covariance function to have long length scales, or to be periodic, or something like that which makes sense in your application, you might have a lot better chance to succeed.

I don’t understand your concern (although granted I am very new to GP regression). I have not yet decided which Kernel function I will use, but I am pretty sure there will be many simple Kernel functions that can give reasonable results for my application.

If I am understanding you correctly, it seems you concern is that the GP model could potentially overfit the data. However, it is my understanding that one of the main benefits of Bayesian GP models is that they “automatically account for the trade-off in model complexity and model fit” without having to use computationally expensive methods like cross validation. Specifically, see Carl Rasumussen’s chapter on Gaussian Processes in Machine Learning in this book, which says the following:

Due to the fact that the Gaussian process is a non-parametric model, the
marginal likelihood behaves somewhat differently to what one might expect from
experience with parametric models… Indeed, the log marginal likelihood consists
of three terms: The first term, is a complexity penalty term, which

measures and penalizes the complexity of the model. The second term a negative
quadratic, and plays the role of a data-fit measure (it is the only term which
depends on the training set output values y). The third term is a log normalization
term, independent of the data, and not very interesting…
Note that the tradeoff between penalty and data-fit in the GP model is automatic.
There is no weighting parameter which needs to be set by some external
method such as cross validation. This is a feature of great practical importance,
since it simplifies training.

If I’m understanding Rasmussen correctly, he seems to be saying that the Bayesian GP model will automatically account for the trade-off in model-complexity and model-fit (i.e. it will avoid overfitting). Therefore, overfitting doesnt seem to be an issue here. However, the issue I’ve raised still remains: can inferences about f be invalid if my iid noise assumption is violated? How robust are inferences about f given a misspecified noise function?

I’m not sure what Rasmussen was trying to say, but it’s very much not correct that Bayesian GP models automatically account for this trade off. (As things that aren’t true go, it’s somewhere between “everyone is good at karaoke” and “the moon is made of cheese”)

In the next few days/weeks, @betanalpha will publish a case study about how Bayesian inference with GPs requires some thought. (as well as how to do it)

So I’m not sure that GPs are an appropriate tool for what you’re trying to do. They can probably do it, but it’s “expert level” GP work.

< TECHICAL BIT>
For example, if you want to know if a GP is above a certain level (like 2/sqrt(n), which you would typically use for a pointwise test), it’s not enough to just compute Z-scores (or their equivalent) because you actually have to look at the function at an infinite number of points. The multiple testing correction for this is difficult to work out (the standard method involves very hard geometry). There are more mathematical delicacies as well.
< /TECHNICAL BIT>

Model misspecification makes all of this stuff even harder.

But there’s nothing fundamental about iid Gaussian noise in GP regression. So if you want to use a more appropriate observation model, use it. Mathematically, you can put a GP into almost any position in almost any statistical model. Practically, you need to be sure that there is a lot of information from the data flowing into the bit of the model with the GP and you need to be very careful to avoid overfitting.

1 Like

This.

The entire motivation of Stan is to be able to build models that capture all of the structure you want to minimize the thread of misspecification. You will still have to take great care to ensure that the model is well-posed and that the computation is smooth enough to ensure accurate fits, and you will still have to vigorously check predictive performance to quantify possible misfit, but building the right model will give you a massive head start towards getting the valid insights you’re after.

1 Like

I think what Rasmussen was saying is that the possibility to avoid overfitting is built into the method and doesn’t require extra machinery bolted on, kind of like the possibility to jump your motorcycle across the grand canyon is built into the physics of projectile motion… but it’s not automatically a safe stunt to perform.

The way you’d go about avoiding overfitting is to think up some very very specific covariance functions narrowly constructed to get you a very specific family of functions that makes sense when put into the position f(x). The notion of “making sense in position f(x)” needs to be something that you have pretty strong opinions about at the moment, or you’re likely out of luck.

@anon75146577 Are there any papers or resources you’d recommend for learning how to avoid overfitting with Bayesian GP regression models? If I simply put vague/flat priors on the hyperparameters of my kernel function, would this be enough to prevent overfitting?

@dlakelan
Let me try to explain what I think Rasmussen meant by avoiding over-fitting. Rasmussen was saying that we can use the marginal log-likelihood to estimate the hyperparameters of our kernel function by taking the derivative of the marginal log-likelihood with respect to the hyperparameters. Essentially this is an Empirical Bayesian approach to Gaussian Process Regression where we use the marginal log-likelihood to estimate the hyperparameters of our kernel function. Rasmussen seems to argue that the resulting hyperparameter estimates will prevent overfitting due to the nature of the marginal log-likelihood function which contains a penalty term for model complexity.

That being said, I would prefer to use a fully Bayesian implementation of GP regression by putting flat priors on my hyperparameters. Therefore, I have the following questions:

  1. If I put vague flat priors on the hyperparameters of the kernel function, will this help prevent overfitting?
  2. Given that I am new to GP regression, are there any papers or resources you’d recommend that explain how to prevent overfitting with GP regression models?

Thanks!

A vague prior is a very bad thing. I am literally at Columbia this week trying to wrap up a a paper about exactly this (more pertinently, @betanalpha has an almost complete Case Study on this that will answer all of your questions). A partial answer is here, but for the rest you’ll have to wait a (very short) while.

Yes, vague is the opposite of what I was suggesting. You need a very strong prior that restricts your function to be unable to wiggle around enough to simply fit through all the data.