Adding Gaussian Process Covariance Functions


#62

Fortunately, I don’t think we’re spiralling.

@drezap, we (you and I) can work together to do what makes sense. You’re not on the hook (alone) to anticipate the maintenance burden of your decisions now.

I think this answering this question will solve most of the remaining, outstanding points: will you match the definition in R&W? Yes and we’re done. No and we’ll have to decide on sigma, jitter, or (sigma and jitter).


#63

A quick note on jitter: we can have that be double only. That addition is cheap relative to everything else and we don’t need to have it templated.


#64

The place to start then is with a use case. How do you want to call it in Stan?

Sure, all things being equal we’d do that. But what we don’t want to do is design something too complicated to implement and maintain, especially initially.

Like I said to Aki above, examples of how you’d like to do combinations would be helpful.

If we know what this single-kernel use should look like, then I don’t see a problem with building that first.

Also keep in mind that people who want to build more complicated models tend to be able to deal with more complexity—it’s self selecting. So slight burdens for combination wouldn’t be too bad, and arguably better than putting a slight burden on single-kernel usage.

As far as I know, there isn’t even a proposal on the table yet for combining kernels in an interesting way. If you want to think of kernels as functions, it may work out better when we have functional types in a year or so (now I’m being optimistic!).


#65

As Aki said: nugget and jitter are two very different things. If for no other reason than one is random and one is deterministic.

Nugget only exists in spatial stats because the old guard don’t like hierarchical models and focussed strongly in Gaussian observation noise. It has no role in modern life (ie it should be somewhere else in the model rather than the covariance function)


#66

Got it. I was using @betanalpha’s nomenclature without realizing that there was a difference between the two terms. It’s correct on the wiki page and that’s what matters.


#67

Excellent point.

We do – see for example all of the examples in my GP case studies,
https://betanalpha.github.io/assets/case_studies/gp_part1/part1.html
https://betanalpha.github.io/assets/case_studies/gp_part2/part2.html
https://betanalpha.github.io/assets/case_studies/gp_part3/part3.html

These use cases motivated the original design spec on the wiki.

There isn’t a full proposal but there is the immediate decision of “do we pack together scaling hyperparameters and jitter/nuggets which makes single kernels slightly easier to implement” or “do we define kernels independent of these hyperparameters which makes it easier to combine kernels together and hence facilitating whatever kernel algebra library might be introduced in the future”.

When I was writing the GP case studies I was definitely yelled at to refer to “nugget” as the small diagonal contribution to stabilize the calculations even when there isn’t a conjugate Gaussian observation. Regardless I’m happy to go with jitter if it won’t upset the spatial stats people.

Regardless, if the diagonal term is templated out then it can used for jitter (by dumping a constant in) or unknown measurement variability (by dumping a parameter in).


#68

You can safely ignore those spatial people. You’re talking about two completely different things. Jitter is an ML hack to fix the terrible condition number of GP covariance matrices that’s basically unknown in classical spatial stats.

I don’t think templating the jitter is a good idea. Including a nugget discontinuity in the covariance function is bad for two reasons. Firstly it’s combining two modelling terms that don’t always work together (imagine you observe the same realisation of the GP twice in the same place. Then the two observations should have different error. A nugget can’t do this. It’s a notational convenience for explaining why observed Variograms don’t go through zero, rather than a useful construction for Stan.)

The second is that it makes non-centring that model a little more difficult as you’d have to put the marginal sd to one, the nugget to zero, AND add a new gausssian term to replace the nuggets (but in non-centred form)


#69

For background, @drezap is currently at Aalto in my group for 3.5 months. In addition there are approximately 5 other people using Stan for GP research in my group. We will make case studies of different types of flexible use of GPs during the following months and should have a few of them ready before StanCon in Helsinki.

Summary for the questions @drezap asked and @syclik’s comments :

  • 1+2. It seems no-one objects this
gp_exp_quad_cov(x, l, sigma) + gp_exp_quad_cov(x, l, sigma, jitter)

so we go with that now.

  • 3 . We follow @syclik’s suggestion for types.

  • We do nothing yet for distance metrics.

  • @drezap will change the pull request to follow R&W.

I’m committed to get better GP support in Stan. I think I have a quite good understanding of the capabilities and restrictions of current Stan and what we are doing the following months is based on that. I know the Stan roadmap includes some useful features coming in the future (e.g. functional types and Stan 3 sub-models), but it will be helpful to do sophisticated GP models already before these features are available. We might consider at some point functions in similar way as in ode, but we can already do a lot with what @drezap has now implemented.

I agree on this and I’m sorry if my writing is so confusing that it gave some other impression. You and I do disagree on what supporting existing users mean, but I’ll let the forthcoming case studies to speak. I’m looking forward for more research in principled priors for single-kernel, multi-kernel, hierarchical and chained GPs.


#70

Just a quick clarification: I think @syclik’s issue was that I had RW in the docs, but I added an extra parameter (as we do in cov_exp_quad) I believe he just want it updated so that we mention we are using the marginal standard deviation in our implementation. If I change it to match RW, then I’d also have to change the currently cov_exp_quad, and remove the marginal standard deviation, so all the special-case Matern kernels are consistent. Is this what we want? Keep sigma or take sigma out?


#71

We want (x, l, sigma, jitter). In the book R&W do not have function arguments, but in their GPML software they include sigma. In Ch 4 they don’t have magnitude sigmas in equations, but in Ch 5 they have , see e.g. 5.4.3


#72

Thanks for adding the clarification. That’s correct.

Imagine trying to understand the function in the future (and you haven’t been looking at GPs over the last week). I can imagine wanting clarity around: how is sigma parameterized? Is it standard deviation? Is it variance? Is it precision (for computational convenience as used in lots of BUGS parameterization and some exponential family literature)? It says it’s the R&W definition that’s been implemented. If you look there, the relevant terms aren’t in the book definition. (I hope that gives a little motivation for having the doc be correct – I’m not trying to make work.)

Great! Can you or @drezap update the issue to reflect that this is the form that’s desired?


#73

Sigma is in standard deviation scale.

The relevant terms are in the next chapter. I agree that it’s not clear as you have to look from two different places, but it still follows R&W definition. It’s better to document the parameterization explicitly instead of just referring to R&W.


#74

That’s great! If that’s the case, then the doc just needs to be updated to point to the correct chapter.

I’m going to guess you didn’t take a look at the actual pull request in question. If not, see here: https://github.com/stan-dev/math/pull/859/files#diff-bc4fde617b2b49e21d28b9ecd2fe6fb3R22

* See Rausmussen & Williams et al 2006 Chapter 4.


#75

@syclik could this be the standard for the math library? It’s great to reference stuff but at some point you’re going to have to tell the C++ API user what parameterization you’re using and I don’t see why the Stan manual (a different project than the math library to some extent) should be the primary place that happens.

For simple functions that’s not too bad for the special functions there’s some real obscure stuff.


#76

It already is! And has been for a really, really long time. There’s a lot of doxygen documentation that even includes the math and the explicit parameterizations.

When I’m looking at the C++, I look at the raw doxygen doc that goes with the code (not the generated html that’s really useful: mc-stan.org/math). When I’m looking at implementing something in the Stan language, I go to the manual, which is the right place to go. The C++ has different types and may be more general than what’s in the Stan language. For example, the C++ might be templated to allow for mixed vector types just because we were good with templating. That might not be legal in the Stan language, though.


#77

It’s actually one of the reasons that we request doxygen doc on functions, even functions in anonymous namespaces. First off, the way we compile the code, you can’t really hide functions from view (yay C++). Second, we’ve had so much duplicated code because we don’t know what a function does. Or we’ve changed it and it breaks things that work. It really, really helps to even state what’s being done and when it’s ok to use the function (and when it’s not).

Anyway, that’s all off-topic. If you want to continue this discussion, let’s start a new thread.

Back to our normally scheduled program. @drezap: do you feel like you have enough info to move forward on the pull request?


#78

@syclik yep


#79

Hi,

I’m just asking but is the idea that l and sigma are going to be only real or can I define that one or both could be GPs, some functions or known vectors?


#80

l can be a vector with different elements for different columns of x
I guess you didn’t mean this but that l and sigma would be functions of x.

sigma as function of x can be handled by pseudocode

g ~ GP(x,l,sigma=1)
h ~ GP(x,l,sigma)
f = g*exp(h)

l as function of x requires special covariance function to get valid covariane function (covariance between x1 and x2 depend both on l(x1) and l(x2)

Instead of having a GP with l as a function of x, it is possible to get similar behavior by transforming x with a non-linear function, which can be a GP and you have then two-layer GP- Stacking enough layers, you will eventually then have deep-GP.


#81

I kind of meant that too. Basically one could create a vector of sigma values, which could be a function of x (and maybe some other parameters).