Scalable Gaussian process inference

We (JP Onnela and I) have been working on a library to scale up Gaussian process inference in Stan using two methods:

  1. Sparse approximations using directed acyclic graphs to encode which dependencies to retain. The method is quite flexible, e.g., nearest-neighbor GPs are a special case of this approximation.
  2. Fourier methods to evaluate the likelihood exactly if the observation points are on a regular grid (in 1D and 2D), e.g, for regularly sampled time series.

Both methods are available as “centered” (evaluating or approximating the likelihood of a realization directly) and “non-centered” (transforming white noise to a realization) parameterizations for performant inference in “strong” and “weak” data regimes, respectively.

Extensive documentation is available at https://gptools-stan.readthedocs.io, the source code can be found at GitHub - onnela-lab/gptools: Gaussian processes on graphs and lattices in Stan and pytorch., and an accompanying preprint with details and examples is available at [2301.08836] Scalable Gaussian Process Inference with Stan.

We’d love to get feedback on the library and preprint from members of the community.

Some highlights (or at least things we think are interesting):

  • Typical runtime for 10k observations is ~15 seconds on my laptop using Fourier methods (without inferring the length scale and amplitude of the kernel). But this of course depends on the details of the model.
  • We discuss the importance of choosing parameterizations in some depth. For sampling-based inference, this is (primarily) a performance concern. But for variational inference, picking an appropriate parameterization is even more important because it affects the quality of the approximation (see figure 2).
  • We provide some guidance for practitioners to choose which method (standard GP inference with a full covariance matrix, sparse approximation, Fourier methods) and parameterization (centered, non-centered) to choose (see figure 5).
  • Because GPs with squared-exponential kernel are very smooth, their Fourier coefficients at high frequencies are small. By using non-centered Fourier methods and discarding high-frequency terms, we can often reduce the number of parameters substantially without significantly affecting the GP (see figure 1). Effectively just a low-pass filter on the process.
  • The library is pip-installable as pip install gptools-stan for use with cmdstanpy. We’re keen to also publish a version to CRAN for use with cmdstanr but my R-skills are ≈ 0.
  • The library implements real FFTs in one and two dimensions to remove redundant complex conjugate Fourier coefficient pairs (although using the full FFT under the hood rather than a more performant implementation).
  • A pytorch version is also available for use with pyro in the same repository.
  • As part of developing the documentation, we also published sphinx-stan — sphinx-stan documentation, a sphinx extension to automatically document Stan code. It supports inline function definitions, loading documentation from .stan files, and cross referencing, including the handling of overloaded functions.

Thank you to @WardBrian for help with Fourier transforms, include directives, and general advice on cmdstanpy.

16 Likes

Super cool. And thanks for linking paper and doc with extensive practical guidance. 10k observations in 15s is amazing. Usually inferring an extra length scale isn’t too much of a computational burden as long as it’s not in a hierarchical setting where the geometry would be highly variable.

I hadn’t seen the low-frequency pruning trick before. As you notice, we only have the full complex FFT implemented in Stan (we’re building on top of Eigen). Would it be worth our implementing a specifically real-valued FFT?

I’m tagging @pgree, who’s been working on similar ideas. And @mitzimorris, who is maintaining CmdStanPy (along with @WardBrian) and also working on scalable spatial models.

Having a real-valued FFT in Stan would be nice. I imagine the forward-pass would not become much quicker. But the gradient evaluation would likely benefit from having a native rFFT because assembling the coefficients probably produces a nasty set of chain rules (see here for the 1D version and here for the 2D version).

Do you know whether there are native functions for evaluating complex conjugates? I’ve currently implemented them here.

Yes, the math library calls that conj (4.6 Complex special functions | Stan Functions Reference)

1 Like

I quickly read through the paper and I did not dig into the code. Is there a way to predict new data with functions like gp_*_cov_rfft2_rng ?

Cool work!

It would be nice to have guidance also which method to favor in comparison to Practical Hilbert space approximate Bayesian Gaussian processes for probabilistic programming | Statistics and Computing, which is available in brms (although I’m co-author of that HSGP paper, I’m also happy if you can show the methods you implemented would always be favorable).

Tell my greetings to Jukka-Pekka!

1 Like

Hi @tillahoffmann , cool paper. As @Bob_Carpenter mentioned, I’ve been working on some related stuff. A couple of collaborators and I just put out a paper on using Fourier representations for GP regression — [2210.10210] Equispaced Fourier representations for efficient Gaussian process regression from a billion data points. We generalize to non-equispaced data in 1, 2, 3 dimensions using the non-uniform FFT and have thus far implemented the algorithms in MATLAB.

We haven’t yet worked on any integration with Stan, but I think that would be a great future direction. It might also not require too much on top of all that you’ve already done. I think the main obstacle would be that there’s no non-uniform FFT in Stan. For our project we used Flatiron’s implementation — Flatiron Institute Nonuniform Fast Fourier Transform — finufft 2.1.0 documentation. They do have a C++ version, but I’m not sure what would be required to get that into Stan.

I have not implemented a gp_*_cov_rfft2_rng method yet. But one could probably use normal_rng together with the gp_inv_rfft2 transform transform to generate samples (see here for details).

Are you thinking of filling in missing values (e.g., interpolation), forecasting (e.g., in a time series setting), or a different problem? I can add something to the list of examples if of interest.

It looks like the main difference is the choice of boundary condition for the basis functions (Dirichlet in the linked paper and periodic with the FFT). But I need to read the Hilbert space approximation paper more carefully.

Will do!

Ah, I missed this one. I wasn’t aware of the non-equispaced fast Fourier transform and will have a look into your paper.

Getting the NUFFT into Eigen and then wrapping in Stan might be an option. But that is probably a more involved long-term project.

Interesting. Are you interested in making things even easier for the user by automating the centering / non-centering?

Hm, good question. It seems tricky at first sight, because determining the right parameterization probably requires already knowing something about the posterior geometry. But there is the publication Automatic Reparameterisation of Probabilistic Programs which sounds promising. cc @mgorinova and @matthewdhoffman

My understanding is that FINUFFT uses FFTW, which is only freely available under a GPL license. This would preclude it from being directly included in Stan. I think it is an existing but low-priority wishlist item for them to support other FTT engines as backends.
Alternatively, one could consider adding support for it but requiring the end user to supply the installation of FFTW, which would mean only Stan programs which are compiled with those features would need to follow FFTW’s terms. This gets tricky, though

Interesting bit from the FFTW FAQ

Question 1.4. What is this about non-free licenses?
The non-free licenses are for companies that wish to use FFTW in their products but are unwilling to release their software under the GPL (which would require them to release source code and allow free redistribution). Such users can purchase an unlimited-use license from MIT. Contact us for more details.
We could instead have released FFTW under the LGPL, or even disallowed non-Free usage. Suffice it to say, however, that MIT owns the copyright to FFTW and they only let us GPL it because we convinced them that it would neither affect their licensing revenue nor irritate existing licensees.

Yep, that’s how things like Matlab are able to use FFTW without being open source themselves. My guess is that wouldn’t be possible for Stan, but I can talk to some of the FINUFFT developers about what their thoughts are

Definitely yes! This is my focus: spatial interpolation of missing values. For example: with a network of sensors identify the faults and re-construct the expected value that could be measured in that point as function of the other measures.

If you do explicit inference on the latent process (rather than marginalizing over it), you can easily model the missing & non-missing then simply index to the non-missing to do the likelihood. But you can also achieve this in generated quantities; both gputools and Aki’s HSGP demo code show how to do the latter.

1 Like

Yes, that’s definitely possible. In this example, we fit a latent GP to count data in different quadrants. We used 80% of the quadrants for fitting and then predicted the density at the remaining 20% of held-out quadrants.

@mike-lawrence, I just remembered that this example also includes a bit on padding as we discussed in Recommendations for padding? · Issue #1 · onnela-lab/gptools · GitHub (see the figure with posterior mean and posterior variance of the latent Gaussian process).

1 Like

Thanks @tillahoffmann !

@tillahoffmann Not to get off topic, but looking at your docs I noticed you had really nice integration between Sphinx and Stan functions and found your project GitHub - tillahoffmann/sphinx-stan: Sphinx domain for Stan, the probabilistic programming language.

This is quite the buried lede! Are you planning on making another thread specifically about that project?

4 Likes

I hadn’t planned on it, but happy to if it’s useful for the community?

1 Like

I’ll be digging into this a bit more over time - just getting started on Stan. But the stats foundations look good to me at first glance. I’ve been working with the sparse general Vecchia approximation (Katzfuss, Matthias, and Joseph Guinness. “A general framework for Vecchia approximations of Gaussian processes.” (2021): 124-141.), which is arguably a more general form of the NNGP and predictive processes. I’ll have to see how this variational GP fits in.