As the title notes, I assume this is a silly idea since I’ve never seen it implemented, but to set the stage, consider the following Gaussian Process model:
data{
int<lower=2> n_xy ;
vector[n_xy] y ;
array[n_xy] real x ;
}
parameters{
real grand_mean ;
real<lower=0> noise_amplitude ;
real<lower=0> gp_amplitude ;
real<lower=0> gp_lengthscale ;
vector[n_xy] gp_helper ;
}
transformed parameters{
// gp: latent (noiseless) GP
vector[n_xy] gp = (
cholesky_decompose(
add_diag(
gp_exp_quad_cov(
x
, gp_amplitude
, gp_lengthscale
)
, gp_amplitude * 1e-5 //jitter
)
)
* gp_helper
) ;
}
model{
// Priors:
grand_mean ~ ... ;
noise_amplitude ~ ... ;
gp_amplitude ~ ... ;
gp_lengthscale ~ ... ;
// gp_helper must be ~ std_normal()
gp_helper ~ std_normal() ;
// Likelihood:
y ~ normal( grand_mean + gp , noise_amplitude ) ;
}
So we have some data y
measured at known locations on some 1D space x
, and we want to account for both gaussian measurement noise as well as a smooth effect of measurement location, with the magnitude of these factors represented by the *_amplitude
parameters, and smoothness of the latter factor represented by the gp_lengthscale
parameter. The possibility of a non-zero grand mean for measurements is placed in an explicit grand_mean
parameter.
While the above computes gp
using gp_exp_quad_cov()
and supplying an amplitude parameter, an equivalent (but probably slightly less efficient) model is:
...
transformed parameters{
// gp_: latent (noiseless) GP with unit-amplitude
vector[n_xy] gp_ = (
cholesky_decompose(
add_diag(
gp_exp_quad_cov(
x
, 1.0
, gp_lengthscale
)
, 1e-5 //jitter
)
)
* gp_helper
) ;
// gp: latent (noiseless) GP
vector[n_xy] gp = gp_ * gp_amplitude ;
}
...
What becomes clear in this latter formulation is that there’s an assumption of sorts that, for gp_amplitude
to play it’s role in scaling the latent vector, gp_
is expected to have a consistent variance, ideally unit variance as we then get to interpret gp_amplitude
as the standard deviation of the latent gp. And presumably the conditioning of gp_helper
with std_normal()
in the model block is intended to contribute to ensuring a consistent variance of gp_
, but in my playing gp_
can nonetheless have rather variable variances even when all parameters other than gp_helper
are fixed.
So it occurs to me then that one can easily enforce a unit variance by:
...
transformed parameters{
// gp_: latent (noiseless) GP with unit-amplitude
vector[n_xy] gp_ = ...;
gp_ = gp_ / sd(gp_) ;
// gp: latent (noiseless) GP
vector[n_xy] gp = gp_ * gp_amplitude ;
}
And similarly, one could additionally enforce a zero mean via:
...
transformed parameters{
// gp_: latent (noiseless) GP with unit-amplitude
vector[n_xy] gp_ = ...;
gp_ = ( gp_- mean(gp_) ) / sd(gp_) ;
// gp: latent (noiseless) GP
vector[n_xy] gp = gp_ * gp_amplitude ;
}
With gp_
constrained to a zero-mean-&-unit-variance vector in this way, I feel like the model is more “straightforward” as a data generating process whereby the GP part solely speaks to the wiggliness of the effect of x
on y
, while gp_amplitude
deterministically encodes the magnitude of this effect with a mean centered on zero so that it is deterministically isolated from other factors in the model (ex. grand_mean
). We lose the ability for the GP to diverge from our explicit model formulation (ex. I don’t think the constrained version would fit well if there were additionally a linear effect of x
; we’d have to explicitly add that effect whereas the unconstrained version could find regions of the parameter space that implicitly reflect a linear effect of x
), but that’s arguably a benefit in forcing us (if doing a proper workflow) to detect and add the missing pieces as explicit structure rather than letting the unconstrained GP’s flexibility try to find it.
Not constraining in this way means that, for example, areas of the parameter space where gp_
has a non-zero mean are less well identified from areas of the parameter space where gp_
has a zero mean but grand_mean
is non-zero. Similarly, without the constraints, areas of the parameter space where gp_
has a lower variance are less well identified from areas of the parameter space where gp_amplitude
has a lower variance.
So why don’t folks do this with GPs? Does the enforcement of zero-mean-&-unit-variance of the latent process induce some sort of pathology? So far in my playing I haven’t encountered anything obvious, but possibly I’m not exploring properly?
Oh! After thinking a bit, I wonder if the constrained parameterization doesn’t eliminate identifiability as I conveyed above, but instead shoves it all into gp_
? For example, the zero-mean constraint means that whenever some elements of gp_
are high, the others must be low, and ditto with the unit-variance constraint. But I wonder if this is a more appropriate/tractable kind of non-identifiability as the specific shapes of the data should inform strongly in that regard?
Or maybe the constraing-transform on gp_
makes for identifiability issues between gp_
and gp_helper
?
Tagging some GP-knowledgeable folks that might weigh in to set me straight; @avehtari @rtrangucci @rok_cesnovar @betanalpha @spinkney @Niko