Gaussian process regression

Hello,

I have a GPR with squared exponential kernel. The design matrix is 5000x14 (5000 observations with 14 continuous variables). Is it possible to solve such problems in Stan? Or is it better to use GPstuff?

Maybe @avehtari could help?

Thank you

1 Like

Not enough information for a complete answer. In special cases (e.g. 1) normal model for the target, single length scale (might even work with several length scales if carefully done), or 2) additive model with basis functions) yes and faster with optimization. In general case, the autodiff may require too much memory or dynamic HMC to integrate over the latent values can be very slow.

2 Likes

Are the continuous predictors on a grid at all? If so see the Kronecker product trick from @avehtari and colleagues here.

3 Likes

Thank you. I would like implementing 2) is there an example I can look at? Do I need GPU?

1 Like

Additive GP with basis functions case study Birthdays (there’s also link to the paper describing the approach in detail)

No

1 Like

Thank you. Is there a way to do something similar to ARD in this case? The problem I have is really explainable AI (XAI). One approach is to use SHAP values to establish importance of features. I would like to design something Bayesian by using Gaussian process.

We’ve done feature importance estimation using Gaussian processes, see Uncertainty-aware Sensitivity Analysis Using Rényi Divergences (and an older version Gaussian processes via sensitivity analysis of the posterior predictive distribution). Were you thinking something like that or something else?

2 Likes

I guess. I was thinking about having Gaussian process model for my system and using ARD to determine important variables. But it seems my problem is too large for latent Gaussian process. Thus the second paper might be a good solution. Do you have Stan or R code to share?

Question. Do you scale the data? Let’s say in your design matrix the first column ranges from 0 to 1 while the second column ranges from 0 to 100. Do you want to scale all columns to the common scale -1 to 1 when calculating the distance for the radial basis function kernel?

Depends on the data, but usually I would normalize continuous covariates to have mean=0 and sd=1, so that it’s easier to define the priors for lengthscales. Depending on the additional domain specific information other scalings can be better to make the covariates to be easier to compare.

Thank you. The original problem was to find important variables. My intention was to use ARD for this. However for 5000 observations and 14 variables with a binomial likelihood ARD was too slow.
My understanding is that the algorithms in both articles require to estimate PPD. In case I have a squared exponential kernel with different length scales I gain nothing with respect to computational time. Perhaps if I use radial basis function kernel instead to estimate PPD for training points with Stan I could reduce computational burden. Then I could apply ideas from the older paper. Right? Or do you have any other suggestions?

ARD is also a bad measure for relevance as it measures mostly non-linearity [1510.04813] Projection predictive model selection for Gaussian processes (published in Projection predictive model selection for Gaussian processes | IEEE Conference Publication | IEEE Xplore)

With that ratio, no need for MCMC. Use something like GPy, GPflow or in R CRAN - Package gplite

Sorry, I don’t understand what you are saying.

If you want to model all interactions, have binomial observation model, 5000 observations and 14 variables, use something else than Stan.

If you are happy with an additive model, have binomial observation model, 5000 observations and 14 variables, you can try the basis function approximated GP also in Stan (probably good to use glm compound functions for speedup).

Can you please explain why?

Can you please provide some link to basis function approximated GP?

Based on the information you gave about the the model, the conditional posterior of the latent values is well approximated with normal distribution and the latent variables can be efficiently integrated out with Laplace, EP or VI and the remaining marginal posterior of the covariance function parameters is low dimensional and close to normal and for variable relevance purposes finding the maximum of the marginal posterior with optimization would be sufficient. The packages I mentioned do this kind of inference efficiently. If you would a more complicated model or would have a much smaller data then MCMC could produce more accurate inference.

Yes. See the earlier message in this thread Gaussian process regression - #5 by avehtari

The original problem can be posed as latent variable GP:
y_i\sim Binomial(5,p_i)
p\sim MultivariateNormal(0,K(x∣\theta))
I would like to estimate posterior of p as well as to determine important variables. There are 5000 observations and vector x has 14 components.
I’m still struggling how to approximate binomial outcome with normal outcome.

The problem I have can be posed as latent variable GP:
y_i\sim Binomial(5,p_i), i=1,...,5000
p\sim MultivariateNormal(0,K(x∣\theta))
I would like to estimate the posterior of p as well as to determine important variables. Vector x has 14 components.
Can you please explain how to integrate out latent variables with variational inference to get marginal posterior of the covariance function parameters? Can I use Stan to perform integration?

This is what I expected, and this is the very basic form of Gaussian process and there are many software packages that can handle this much faster than Stan.

See Chapter 3 in Gaussian processes for machine learning and e.g. Scalable Gaussian process inference using variational methods

There are many GP specific software packages implementing these.

Currently, you can use Stan MCMC to perform integration, but it is currently much slower than GP specific software (in the future it will be less slow but still slower than GP specific software).

There is also a branch for Stan that implements embedded Laplace approximation to integrate over the latent values, see Hamiltonian Monte Carlo using an adjoint-differentiated Laplace approximation: Bayesian inference for latent Gaussian models and beyond, but that may require some work to get running and as your model is that simple, Stan doesn’t provide any additional benefit over faster GP specific software.

when you refer to “many software packages” doing faster GP inference are you talking -for example- about INLA or you have other software in mind?

1 Like

INLA is good for Gaussian Markov Random Field priors, but not for Gaussian process priors beyond 2 dimensions. I did list above these packages

but there are several others, too.

1 Like