Adding Gaussian Process Covariance Functions


#42

The issue here is that the algebraic structure of covariance kernels means that you can define the “basis kernels” in a number of ways, including with or without multiplicative constants which for most of these kernels has the interpretation of a marginal variance or marginal standard deviation, depending on how you parameterize it.

The additive \sigma nominally comes from considering a GP prior and a Gaussian observation model with measurement variability \sigma in which case the kernel of the GP posterior becomes the sum of the prior kernel and the additive term. Consequently people often ignore it when talking about the possible kernels themselves. That said, in practice a small diagonal term is helpful for stabilizing calculations even when the GP is latent in the generative model, so it should definitely be added. In this case the additive term is known as a “nugget”.

I would recommend following the convention established with the exp_quad kernel, where we include a nugget and a marginal standard deviation in addition to the kernel a length scale. In particular, it should be very easy switch between these two kernels without changing the hyperparameters.

Might also want to get @Daniel_Simpson’s opinion.


#43

I have PRs for the matern 3/2 and matern 5/2. exponential I took down because I’m implementing the rev part.

Having an issue though: I’ve looked in the stan/math library and section 5 of the Autodiff paper: it looks like to allocate in stan’s memory arena we can use ChainableStack::memalloc_. In cov_exp_quad someone uses ChainableStack::context().memalloc_, which I could compile, and passed all my tests locally. In the PR for gp_periodic_cov.hpp they use ChainableStack::memalloc_, but both I couldn’t get working on jenkins. How do I make sure my compiler will throw the same errors as on jenkins? That way, I can debug myself.

Also - can someone please verify that the members of the GP library that need be deprecated are:
cov_exp_quad.hpp
multi_gp_cholesky_log.hpp
multi_gp_cholesky_lpdf.hpp
multi_gp_log.hpp
multi_gp_lpdf.hpp


#44

Maybe not worth including in the PR, but I’d thought I’d mention: For the special case of d=1 and \nu = \frac 1 2 case, the inverse of the Gram matrix has a closed form symmetric tri-diagonal structure which can really speed up the likelihood calculations.


#45

@aaronjg interesting! can you send a reference?


#46

I haven’t seen that particular result, but isn’t there something about the inverse covariance matrix of a Gaussian encoding the conditional independencies? I.e. the C^{-1}_{ij}=0 if i and j are conditionally independent. Since Matérn kernels describe Markov processes in 1D (per Simo Sarkkas whole Kalman-filtering-for-GPs line of work), it seems reasonable that they should all be reducible to tri-diagonal matrices if the inputs are ordered, as that would correspond to Markov structure, no? Maybe it’s derivable from Simo’s papers.


#47

@Bonnevie, that’s exactly right. I first saw it in Finley et. al 2007 but I believe the result is originally from Rybicki 1994 or Rybecki & Press 1995. I used this result in my StanCon talk as well.


#48

Is this the result where you can represent the GP with a tridiagonal covariance matrix, the the matrix is extremely poorly conditioned so the \mathcal{O}(N) calculations end up being numerically fragile? @Daniel_Simpson, can you correct me here?


#49

I believe this is correct. I remember Finn comparing this method https://onlinelibrary.wiley.com/doi/full/10.1111/j.1467-9469.2008.00610.x
with the exact Markov representation (given in Rue and Held’s book) and finding that the “aproximation” was much more exact than the exact formula. It should be easy to check.


#50

No, this is a result about the precision matrix.


#51

Isn’t it the inverse that’s tridiagonal? Or am I thinking of something different?


#52

Sure, inverse/precision. The relevant point is that while you have a structured inverse there some subtle numerical instabilities that can cause problems, per Dan’s comment.


#53

The link that @Daniel_Simpson mentioned has a tridiagonal covariance matrix and a dense precision matrix, so this is different from the Rybecki result. The Rybecki result is an exact solution, and not an approximation, and it is the \Sigma^{-1} that matters for the likelihood calculation, so having an exact solution makes things more numerically stable. The fact that it’s tridiagonal is another win, since that takes it from \mathcal O(N^2) to \mathcal O(N)


#54

Nope. Dense covariance (it’s an OU process. Maybe it’s intrinsic limit)


#55

Hi - so @avehtari and I had a conversation about the covariance functions I want to bring up here, so that I can include them on this iteration of implementations:

  1. Marginal standard deviation: Including \sigma as a parameter (gp_exp_quad_cov(x, sigma, l), or just including this in the stan model: sigma * gp_exp_quad_cov(x, l). I know it will be faster in rev to have the parameter included w/ precomputed gradients, thoughts? I’m leaning on including it, @syclik brought this up on the pull request (PR) so I wanted to bring it up here, I followed @betanalpha’s suggestion in the more recent PR.

  2. Jitter: including an overloaded gp_exp_quad_cov(x, l, sigma, jitter) specification. The only issue with this is that for summation of covariance functions:

gp_exp_quad_cov(x, l, sigma) + gp_exp_quad_cov(x, l, sigma, jitter)

users might include more than one. Previously on betancourt’s case study, he used diag_matrix(rep_vector(1e-10, N)); but then we have the issue of summing many 0 elements, so probably best to include the overloaded version.

  1. Should I support mixed eigen vectors and row vectors? i.e.
vector[N1] x1;
row_vector[N2] x2;
matrix[N1, N2] cov = gp_exponential_cov(x1, x2, sigma, l);



This item I won’t get to on this iteration, but warrants discussion:

  1. Different distance metrics: Have an overloaded option that, instead of calculating the squared_distance within the function, we can have a specification gp_function_cov(distances, sigma, l) which includes pre-calculated distances, or we can use a functional approach for users to plug in whatever distance function users want. I’m not sure how the functional approach would affect autodiff.

@Bob_Carpenter


#56

Thanks much for reporting back. It helps to keep all of our development open and get feedback when it’s important. Just keep in mind that the feedback can be totally overwhelming and you’ll have to try to navigate a coherent path through it.

  1. Analytic gradients also save space, which is also important.

I think it’s best to keep things like this indendent because otherwise the cross-product of features can get out of control. You also make a good argument for not including it in each function.

That’s very inefficient. We need an add_diag(matrix a, vector b) function and maybe even a broadcasting version add_diag(matrix a, real b). You could always add one of those to Stan if you need it. Of course, it’d be better if we could control all this with expression templates, but that’s not coming any time soon.

I’d be OK insisting everything is a vector unless there’s a strong argument for supporting row vectors (or arrays).

This is more one for the GP users. Would you use this? The problem is that it blows out into a big Jacobian if the distances can be parameters (they probably won’t be).

That’s what we do with ODEs and the algebraic solver. It’s a bit awkward to code in the parser as we don’t really have functional type inference yet, so it’s handled on a case-by-case basis.

All the functions in Stan are differentiable so everything will compose naturally.


#57

There is unavoidable tension between a design optimized for specifying GPs with a single kernel and a design optimized for mixing and kernels. My opinion is that for the near future our priority should be the former as most users will not be considering adding or multiplying more than one kernel together. Those that are considering it can achieve what they want by setting the marginal standard deviations to unity and the nuggets (not jitter, from what I understand about the accepted nomenclature) to zero which isn’t so much extra work to be onerous.

Similarly I think that a conversation about functionals to accept different distance metrics is way too premature at the moment. Because we don’t have a good sense of how the “GPs in Stan” will develop we don’t know just how much flexibility we’ll need, and a super flexible design just in case will just add unnecessary maintenance burden for the time being.


#58

Sounds like we’re getting somewhere.

My suggestions:

  1. include sigma in the function signature. The reason I brought it up was because I was checking the math against the reference included and it didn’t match. It wasn’t because the term was in there. I’ll still insist that it’s explicitly written that it’s not the Rasmuseun and Williams definition. (That’ll just confuse someone in the future.)
  2. Include the nugget term. @betanalpha: it’s called jitter in the wiki page on “Adding a Gaussian Process Covariance Function.” If we now prefer the term “nugget,” we should update it.
  3. I think we should support the types we support for cov_exp_quad() in the Stan language. That’s std::vector<T>, Eigen::Matrix<T, Eigen::Dynamic, 1>,Eigen::Matrix<T, 1, Eigen::Dynamic>, whereTisdoubleorstan::math::var, but not mixed types. If you're going to restrict it more, I'd drop thestd::vector`.

Thoughts?


#59

I strongly disagree. “Most” users probably will use GPs via rstanarm and brms type interfaces, but to implement useful interesting models we need flexibility and speed.

I disagree on this. First we shouldn’t use nugget as it is rarely used outside spatial statistics. Second nugget refers to independent noise term, which we often add as sigma. It is useful to make the separation between jitter and noise terms, where jitter is added mainly due to the computational reasons, although it can be considered also as the lower bound on noise. The difference comes in some equations where jitter is still included but noise term is removed. Also in non-normal likelihood case jitter is usually that small that it has negligible effect on predictive distribution, but stabilizes necessary computations.

I think it is good to discuss design early, even if decide now to do something simpler.

I have a good idea about the flexibility I want, I don’t know about the maintenance burden, but I know that the implementation cost in coding time is currently big and some future changes will make things easier.

@drezap had made a mistake, and he will fix that so it matches R&W

I very strongly prefer jitter with the reasons mentioned above.


#60

There are a lot of really solid interfaces for Gaussian processes with single kernels, so I’d argue that it would be worth more if Stan played to its strengths, namely its general hierarchical sampling capabilities and its automatic differentiation (although the latter is shared with GPflow), which is an argument for complex parameterized kernels I think.

Something I have been missing is the ability to easily model linear transformations of Gaussian processes, which basically corresponds to linear transformations of the kernels. Derivatives and integrals, in particular, are useful in e.g. probabilistic numerics.


#61

Let me rephrase – I do not think that our immediate priority for Stan development is support for a rich, closed kernel algebra for users who want to build sophisticated Gaussian processes. At the same time I don’t think that a “Gaussian process library” in brms/rstanarm that mimics GAMM functionality will be all that user-friendly given the very careful and bespoke hyperparameter specifications that would be needed to ensure identifiability and reasonable computation.

Instead I believe that our best option right now is to facilitate the use of single-kernel GPs with principled hyperpriors within more complex Stan programs, providing a powerful new tool to model builders. In other words I want to focus our initial support existing users and not an external GP community that’s just looking for a new GP tool.

The latter makes implementing more complex GPs only marginally more challenging (dropping 1s and 0s in some of the functions) and so doesn’t impose a significant burden on those who want to specify more sophisticated composite kernels.

We’re gonna get stuck in another spiral of differing opinions here so I suggest that ultimately @drezap decide how they want to move forwards.