Sharing a to-me-neat parameterization for Gaussian Processes

Nothing spectacular here, but finally got around to implementing an idea I had and wanted to share in case others find it useful. When doing GPs where you don’t “integrate-out” the latent GP, you’re left with two parameters that compete: the GP-amplitude parameter and the parameter encoding residual measurement noise. So it occurred to me that it might make sense to actually incorporate this trade-off explicitly with a parameter that encodes the proportion of variance observed in the data that is attributable to noise, the inverse of which then determines the GP amplitude. Attached to this post is an implementation of this, along with R code to play and get pretty visuals like this:

One thing I’ll note is that it’s possibly questionable that I technically do not do any inference on the magnitude of variation expected from the sum of the GP and measurement noise, effectively encoding a belief that the observed variance perfectly estimates this quantity. I’m not sure what aspects of inference this compromises however; if anyone has thoughts on this, I’d appreciate the feedback!

pvgp_demo.r (2.1 KB) pvgp_generate.stan (1.6 KB) pvgp_sample.stan (3.9 KB)


I think what you describe is a good reparametrization technique - I’ve used it in a similar case (but with a Kalman filter) where I’ve also had trouble that there are two sources of variance that are hard to distinguish. I’ve worked with a parametrization that seems very similar to yours, but I explicitly modelled both the total variance and the proportion as (code not tested):

parameters {
   real<lower=0> sd_total;
   real<lower=0, upper=1> prop_sd1;

transformed parameters {
   real<lower=0> sd1;
   real<lower=0> sd2;
       real total_variance = square(sd_total);
       sd1 = sqrt(prop_sd1 * total_variance);
       sd2 = sqrt((1 - prop_sd1) * total_variance);

Which I think underlines how your parametrization is going to alter inference: you are making an additional assumption that sd_total (in my notation above) is always 1 (after your scaling to sd of the data). That does not sound as necessarily reasonable.

However, allowing sd_total to vary in your model leads to correlation between logit_pvn and sd_total and some divergences so I am actually not sure how to parametrize such a model well without making an assumption about the total variance.

Does that make sense?


Another version of this trick is encountered in @mitzimorris’s amazing spatial models tutorial, where the BYM2 model incorporates the combined spatial and non-spatial variances as a single parameter, with a separate parameter to allocate the variance components to the spatial and nonspatial random effects.

I’ve encountered models with fairly vanilla-looking nested random effects where this parameterization also seems to help (can even define a simplex of proportions to allocate variance among multiple nested levels of a random effect). Sometimes these models are even nicely interpretable: for example if area is nested in region and species is nested in genus, then we can speak of “spatial” and “taxonomic” variances that are decomposed into their constituent levels.