As the title notes, I assume this is a silly idea since I’ve never seen it implemented, but to set the stage, consider the following Gaussian Process model:

```
data{
int<lower=2> n_xy ;
vector[n_xy] y ;
array[n_xy] real x ;
}
parameters{
real grand_mean ;
real<lower=0> noise_amplitude ;
real<lower=0> gp_amplitude ;
real<lower=0> gp_lengthscale ;
vector[n_xy] gp_helper ;
}
transformed parameters{
// gp: latent (noiseless) GP
vector[n_xy] gp = (
cholesky_decompose(
add_diag(
gp_exp_quad_cov(
x
, gp_amplitude
, gp_lengthscale
)
, gp_amplitude * 1e-5 //jitter
)
)
* gp_helper
) ;
}
model{
// Priors:
grand_mean ~ ... ;
noise_amplitude ~ ... ;
gp_amplitude ~ ... ;
gp_lengthscale ~ ... ;
// gp_helper must be ~ std_normal()
gp_helper ~ std_normal() ;
// Likelihood:
y ~ normal( grand_mean + gp , noise_amplitude ) ;
}
```

So we have some data `y`

measured at known locations on some 1D space `x`

, and we want to account for both gaussian measurement noise as well as a smooth effect of measurement location, with the magnitude of these factors represented by the `*_amplitude`

parameters, and smoothness of the latter factor represented by the `gp_lengthscale`

parameter. The possibility of a non-zero grand mean for measurements is placed in an explicit `grand_mean`

parameter.

While the above computes `gp`

using `gp_exp_quad_cov()`

and supplying an amplitude parameter, an equivalent (but probably slightly less efficient) model is:

```
...
transformed parameters{
// gp_: latent (noiseless) GP with unit-amplitude
vector[n_xy] gp_ = (
cholesky_decompose(
add_diag(
gp_exp_quad_cov(
x
, 1.0
, gp_lengthscale
)
, 1e-5 //jitter
)
)
* gp_helper
) ;
// gp: latent (noiseless) GP
vector[n_xy] gp = gp_ * gp_amplitude ;
}
...
```

What becomes clear in this latter formulation is that there’s an assumption of sorts that, for `gp_amplitude`

to play it’s role in scaling the latent vector, `gp_`

is expected to have a consistent variance, ideally unit variance as we then get to interpret `gp_amplitude`

as the standard deviation of the latent gp. And presumably the conditioning of `gp_helper`

with `std_normal()`

in the model block is intended to contribute to ensuring a consistent variance of `gp_`

, but in my playing `gp_`

can nonetheless have rather variable variances even when all parameters other than `gp_helper`

are fixed.

So it occurs to me then that one can easily enforce a unit variance by:

```
...
transformed parameters{
// gp_: latent (noiseless) GP with unit-amplitude
vector[n_xy] gp_ = ...;
gp_ = gp_ / sd(gp_) ;
// gp: latent (noiseless) GP
vector[n_xy] gp = gp_ * gp_amplitude ;
}
```

And similarly, one could additionally enforce a zero mean via:

```
...
transformed parameters{
// gp_: latent (noiseless) GP with unit-amplitude
vector[n_xy] gp_ = ...;
gp_ = ( gp_- mean(gp_) ) / sd(gp_) ;
// gp: latent (noiseless) GP
vector[n_xy] gp = gp_ * gp_amplitude ;
}
```

With `gp_`

constrained to a zero-mean-&-unit-variance vector in this way, I feel like the model is more “straightforward” as a data generating process whereby the GP part solely speaks to the wiggliness of the effect of `x`

on `y`

, while `gp_amplitude`

deterministically encodes the magnitude of this effect with a mean centered on zero so that it is deterministically isolated from other factors in the model (ex. `grand_mean`

). We lose the ability for the GP to diverge from our explicit model formulation (ex. I don’t think the constrained version would fit well if there were additionally a linear effect of `x`

; we’d have to explicitly add that effect whereas the unconstrained version could find regions of the parameter space that implicitly reflect a linear effect of `x`

), but that’s arguably a benefit in forcing us (if doing a proper workflow) to detect and add the missing pieces as explicit structure rather than letting the unconstrained GP’s flexibility try to find it.

Not constraining in this way means that, for example, areas of the parameter space where `gp_`

has a non-zero mean are less well identified from areas of the parameter space where `gp_`

has a zero mean but `grand_mean`

is non-zero. Similarly, without the constraints, areas of the parameter space where `gp_`

has a lower variance are less well identified from areas of the parameter space where `gp_amplitude`

has a lower variance.

So why don’t folks do this with GPs? Does the enforcement of zero-mean-&-unit-variance of the latent process induce some sort of pathology? So far in my playing I haven’t encountered anything obvious, but possibly I’m not exploring properly?

Oh! After thinking a bit, I wonder if the constrained parameterization doesn’t eliminate identifiability as I conveyed above, but instead shoves it all into `gp_`

? For example, the zero-mean constraint means that whenever some elements of `gp_`

are high, the others must be low, and ditto with the unit-variance constraint. But I wonder if this is a more appropriate/tractable kind of non-identifiability as the specific shapes of the data should inform strongly in that regard?

Or maybe the constraing-transform on `gp_`

makes for identifiability issues between `gp_`

and `gp_helper`

?

Tagging some GP-knowledgeable folks that might weigh in to set me straight; @avehtari @rtrangucci @rok_cesnovar @betanalpha @spinkney @Niko