Applied Gaussian Processes in Stan, Part 1. A Case Study

Hey all - More specifically, hey GP people -

Writing up some stuff I’ve done over the summer at Aalto (although a bit late…). Aiming for an applied, pedagogical intro to applied Gaussian Processes in Stan, using a variety of open source data sets.

This is Part 1: Intro to GP notation, motivation behind GPs, some properties of kernels, GP regression and ARD.

Some things I could use help with:

  1. Hyperparameter priors: I could at least use a length scale that uses the scale of the data, as in Betancourt’s case study.

  2. Out of sample: Using Loo package instead of a naive RMSE.

  3. I was trying to show the smoothness changes in different cases of Matern kernel, but the sample size to high/vizualizations too small to show the “wiggliness” of the kernels.

Part 2 is coming. It has some different models including GP classificaiton, GP time series (modeling directly and using a GP prior in GARCH), and then a survival model. I split it up because the case study was becoming too long.

applied_gaussian_processes_in_stan.pdf (419.7 KB)


I noticed that you are using some functions, such as add_diag and gp_exp_quad_cov, that do not appear to be defined in the functions block and also do not appear in the documentation for Stan. Are these undocumented features in the latest CmdStan release, or are they from a development version?

1 Like

good question.

I think add_daig is in 2.18 math library, but it hasn’t been added to the language yet. There are covariance functions such as the matern32 and matern52 that I’m using locally, and still pushing through PRs to get them merged into the Stan library. Good call on the docs, I need to update the stan docs to show that at least add_diag has been included. Besides these technical issues, do you have any comments about modeling issues or the pedagogy of the doc in general? I know it’s basic, but I’m trying to show users how to build up from nothing!

1 Like

I am curious about how identifiable the length scales in ARD are. I’ve had issues with that myself in dealing with GPs, and I’m curious as to how much of that is at my end and how much is just inherent to GP regression.

yeah, that warrants another case study. I could trivially make the real<lower=0>length_scale[D] an ordered vector, but then we would be assuming that the length_scale1 < length_scale2 < length_scale3 < … and I don’t think this is a good assumption. I could run a model a bunch of times w/ different random seeds and see where they tend to fall, and re-order the dataset in that way but it’s a hack… I’ll put some actually though into it, thank you.

Really fantastic work @anon79882417 I took in a lot of lessons from your methodology and presentation.

That is a good question. I find there is often an interplay between the length scales and the variance, so that it may not make sense to treat the two as independent hyper-parameters.

Hey @anon79882417, thanks for putting all these together. By any chance, have you prepared the part related to classification? I am trying to understand the function gp_pred_rng(…) and I thought asking your help/write-up. Thanks

@anon79882417 Yes I am a real person with a real question. Considering that the likelihood is not Guassian in the GP classification, I am trying to understand how did you come up with the posterior predictive for the test data in function gp_pred_rng(…)? I am basically referring to this post that you have for classification.

The integral were evaluation for a non Gaussian likelihood is this:

\int \pi(y|f, X, \theta) \pi(f | X, \theta) \pi(x|\theta)

In Gaussian likelihood case the \pi(f | •) gets marginalized out, but here it does not. Instead you generate latent f from:

  1. Computer covariance function K
  2. Compute cholesky decomp K=LL’
  3. Then use L to sample from MV Gaussian and f is a realization from that MV Gaussian so: f = \rho * L, \rho \sim N(0, 1)
  4. And then use that f in your likelihood which I depends on the model you specify. f could be a prior for a parameter in a linear model, you could use it directly in the logistic or multi nominal likelihood, etc.

A good reference is GPML, and I think most of this is in the Classification chapter except step 4.

The equations for ‘gp_pred_rng’ are in the document, I believe on the first page. Keywords are “posterior predictive”

1 Like

Yes, apologies. Please let me know if you have any other questions, and feel free to open another thread with code if you have questions on implementation details. There are better references than that document.


See Section 3.3 in for an example of how to implement a non-Gaussian observational model. If you replace the Poisson distributions (for integer-valued observations) with Bernoulli distributions (for binary-valued observations) then you get a Gaussian process model amenable to classification. In other words you end up modeling the logit probability with the Gaussian process.

1 Like

Hi! Cool stuff, thanks! Did you ever find the time to write that second part? :)