Likely-silly question on Gaussian Processes: can/should they be scale()'d?

mike-lawrence · February 7, 2022, 10:56pm

As the title notes, I assume this is a silly idea since I’ve never seen it implemented, but to set the stage, consider the following Gaussian Process model:

data{
	int<lower=2> n_xy ;
	vector[n_xy] y ;
	array[n_xy] real x ;
}
parameters{
	real grand_mean ;
	real<lower=0> noise_amplitude ;
	real<lower=0> gp_amplitude ;
	real<lower=0> gp_lengthscale ;
	vector[n_xy] gp_helper ;
}
transformed parameters{
	// gp: latent (noiseless) GP
	vector[n_xy] gp = (
		cholesky_decompose(
			add_diag(
				gp_exp_quad_cov(
					x
					, gp_amplitude
					, gp_lengthscale
				)
				, gp_amplitude * 1e-5 //jitter
			)
		)
		* gp_helper
	) ;
}
model{
	// Priors:
	grand_mean ~ ... ;
	noise_amplitude ~ ... ;
	gp_amplitude ~ ... ;
	gp_lengthscale ~ ... ;
	// gp_helper must be ~ std_normal()
	gp_helper ~ std_normal() ;
	// Likelihood:
	y ~ normal( grand_mean + gp , noise_amplitude ) ;
}

So we have some data y measured at known locations on some 1D space x, and we want to account for both gaussian measurement noise as well as a smooth effect of measurement location, with the magnitude of these factors represented by the *_amplitude parameters, and smoothness of the latter factor represented by the gp_lengthscale parameter. The possibility of a non-zero grand mean for measurements is placed in an explicit grand_mean parameter.

While the above computes gp using gp_exp_quad_cov() and supplying an amplitude parameter, an equivalent (but probably slightly less efficient) model is:

...
transformed parameters{

	// gp_: latent (noiseless) GP with unit-amplitude
	vector[n_xy] gp_ = (
		cholesky_decompose(
			add_diag(
				gp_exp_quad_cov(
					x
					, 1.0
					, gp_lengthscale
				)
				,  1e-5 //jitter
			)
		)
		* gp_helper
	) ;

	// gp: latent (noiseless) GP
	vector[n_xy] gp = gp_ * gp_amplitude ;

}
...

What becomes clear in this latter formulation is that there’s an assumption of sorts that, for gp_amplitude to play it’s role in scaling the latent vector, gp_ is expected to have a consistent variance, ideally unit variance as we then get to interpret gp_amplitude as the standard deviation of the latent gp. And presumably the conditioning of gp_helper with std_normal() in the model block is intended to contribute to ensuring a consistent variance of gp_, but in my playing gp_ can nonetheless have rather variable variances even when all parameters other than gp_helper are fixed.

So it occurs to me then that one can easily enforce a unit variance by:

...
transformed parameters{

	// gp_: latent (noiseless) GP with unit-amplitude
	vector[n_xy] gp_ = ...;
	gp_ = gp_ / sd(gp_) ;

	// gp: latent (noiseless) GP
	vector[n_xy] gp = gp_ * gp_amplitude ;

}

And similarly, one could additionally enforce a zero mean via:

...
transformed parameters{

	// gp_: latent (noiseless) GP with unit-amplitude
	vector[n_xy] gp_ = ...;
	gp_ = ( gp_- mean(gp_) ) / sd(gp_) ;

	// gp: latent (noiseless) GP
	vector[n_xy] gp = gp_ * gp_amplitude ;

}

With gp_ constrained to a zero-mean-&-unit-variance vector in this way, I feel like the model is more “straightforward” as a data generating process whereby the GP part solely speaks to the wiggliness of the effect of x on y, while gp_amplitude deterministically encodes the magnitude of this effect with a mean centered on zero so that it is deterministically isolated from other factors in the model (ex. grand_mean). We lose the ability for the GP to diverge from our explicit model formulation (ex. I don’t think the constrained version would fit well if there were additionally a linear effect of x; we’d have to explicitly add that effect whereas the unconstrained version could find regions of the parameter space that implicitly reflect a linear effect of x), but that’s arguably a benefit in forcing us (if doing a proper workflow) to detect and add the missing pieces as explicit structure rather than letting the unconstrained GP’s flexibility try to find it.

Not constraining in this way means that, for example, areas of the parameter space where gp_ has a non-zero mean are less well identified from areas of the parameter space where gp_ has a zero mean but grand_mean is non-zero. Similarly, without the constraints, areas of the parameter space where gp_ has a lower variance are less well identified from areas of the parameter space where gp_amplitude has a lower variance.

So why don’t folks do this with GPs? Does the enforcement of zero-mean-&-unit-variance of the latent process induce some sort of pathology? So far in my playing I haven’t encountered anything obvious, but possibly I’m not exploring properly?

Oh! After thinking a bit, I wonder if the constrained parameterization doesn’t eliminate identifiability as I conveyed above, but instead shoves it all into gp_? For example, the zero-mean constraint means that whenever some elements of gp_ are high, the others must be low, and ditto with the unit-variance constraint. But I wonder if this is a more appropriate/tractable kind of non-identifiability as the specific shapes of the data should inform strongly in that regard?

Or maybe the constraing-transform on gp_ makes for identifiability issues between gp_ and gp_helper?

Tagging some GP-knowledgeable folks that might weigh in to set me straight; @avehtari @rtrangucci @rok_cesnovar @betanalpha @spinkney @Niko

betanalpha · February 10, 2022, 7:57pm

This may or may not be helpful for a given model, but what it isn’t is a gaussian process. Recall that by definition every finite-dimensional projection of a gaussian process is specified by a multivariate normal density function. Introducing those constraints modified those finite-dimensional projections and hence the implicit infinite dimensional object that is being represented.

An immediate practical consequence of this models over a different number of input points will no longer be consistent. Adding a new variable requires either changing the constraint on the initial variables, and hence their probabilistic behavior, or forcing the new variable to a fixed value. Similarly removing a variable requires modifying the constraints and hence the behavior of the other variables.

Besides the math are introducing constraints like this a good idea? I don’t personally think so. The mean and variance implied by a gaussian process determine ensemble expectations, not the behavior of any one functional realization. A set of output values from a function modeled with a zero-mean gaussian process prior will not in general average out to zero. Indeed concentration of measure arguments tell us that the typical realizations will exhibit fluctuations that pull the average away from zero. Again the mean and variance are relevant to the ensemble expectation of many functional realizations, not any one.

Of course one can always implement these constraints as a new, heuristic prior model that deviates from a gaussian process, but then one has to argue that new model is appropriate for the given analysis. These kinds of constraints often seem to be helpful superficially (indeed they’re often recommended as a hack for multi-factor modeling) but I find that they cause more trouble than they’re worth.

Topic		Replies	Views
Noiseless Hilbert Space Gaussian Process? Modeling specification	1	400	March 20, 2023
Gaussian Process regression: done but could be improved? Modeling performance , gaussian-process	3	977	May 23, 2017
Sharing a to-me-neat parameterization for Gaussian Processes Modeling techniques , specification	2	1096	April 10, 2021
Generating quantities from Gaussian processes Developers gaussian-process	2	280	March 28, 2024
How does one do predictive inference with the sum of gaussian processes? Modeling gaussian-process	6	111	December 11, 2024

Likely-silly question on Gaussian Processes: can/should they be scale()'d?

Related topics