Using 2D Gaussian process predictions within model

I don’t yet understand fully what degree of hierarchy you have in your data, but starting from a non-hierarchical case, a given lengthscale and amplitude combination uniquely define a covariance matrix K, which should be computed only once in a model. If you have a model that necessitates multiple realizations from the multivariate normal implied by K, then (taking the non-centered case and referring to variable names of my example) you need a separate gp_helper variable for each realization, which can most easily be achieved by making gp_helper a matrix. To illustrate, let’s say you have two observed-data vectors in y that you want to model as separate GP realizations but with a shared lengthscale & amplitude:

data{
	int<lower=2> n_xy ;
	array[n_xy] real x ;
	matrix[n_xy,2] y ;
}
parameters{
	real grand_mean ;
	real<lower=0> noise_amplitude ;
	real<lower=0> gp_amplitude ;
	real<lower=0> gp_lengthscale ;
	matrix[n_xy,2] gp_helper ;
}
transformed parameters{
	// gp_L: cholesky-decomposed gp covariance
	matrix[n_xy,n_xy] gp_L = cholesky_decompose(
		add_diag(
			gp_exp_quad_cov(
				x
				, gp_amplitude
				, gp_lengthscale
			)
			, gp_amplitude * 1e-5 //jitter
		)
	) ;

	// gp: latent (noiseless) GP
	matrix[n_xy,2] gp ;
	gp[,1] = gp_L * gp_helper[,1] ;
	gp[,2] = gp_L * gp_helper[,2] ;
}
model{
	// Priors:
	grand_mean ~ ... ;
	noise_amplitude ~ ... ;
	gp_amplitude ~ ... ;
	gp_lengthscale ~ ... ;
	// gp_helper must be ~ std_normal()
	to_vector(gp_helper) ~ std_normal() ;
	// Likelihood:
	y[,1] ~ normal( grand_mean + gp[,1] , noise_amplitude ) ;
	y[,2] ~ normal( grand_mean + gp[,2] , noise_amplitude ) ;
}

(note that code is pedagogically-aimed; speedup could be achieved with indexing to vectorize the likelihood)

Again, the above instantiates a model whereby the latent GPs are independent realizations of a common latent multivariate normal, so K only needs to be computed once. K also only needs to be computed once if the two latent GPs share a lengthscale but differ in amplitude, since

cholesky_decompose(gp_exp_quad_cov(x,amp,ls))*helper

is equal to

(cholesky_decompose(gp_exp_quad_cov(x,1.0,ls))*helper)*amp

So in the case of multiple amplitudes but common lengthscale you’d do:

...
parameters{
...
	vector<lower=0>[2] gp_amplitude ;
...
}
transformed parameters{
	// gp_L: cholesky-decomposed gp covariance
	matrix[n_xy,n_xy] gp_L = cholesky_decompose(
		add_diag(
			gp_exp_quad_cov(
				x
				, 1.0
				, gp_lengthscale
			)
			,  1e-5 //jitter
		)
	) ;

	// gp: latent (noiseless) GP
	matrix[n_xy,2] gp ;
	gp[,1] = (gp_L * gp_helper[,1]) * gp_amplitude[1] ;
	gp[,2] = (gp_L * gp_helper[,2]) * gp_amplitude[2] ;
}
...

If the lengthscales are different though, you have to compute the covariance matrix anew for each:

...
parameters{
...
	vector<lower=0>[2] gp_lengthscale ;
...
}
transformed parameters{
	// gp_L: cholesky-decomposed gp covariance
	array[2] matrix[n_xy,n_xy] gp_L ;
	for( i in 1:2){
		gp_L[i] = 	cholesky_decompose(
				add_diag(
					gp_exp_quad_cov(
						x
						, 1.0
						, gp_lengthscale[i]
					)
					,  1e-5 //jitter
				)
			) ;
	}
	// gp: latent (noiseless) GP
	matrix[n_xy,2] gp ;
	gp[,1] = (gp_L[1] * gp_helper[,1]) * gp_amplitude[1] ;
	gp[,2] = (gp_L[2] * gp_helper[,2]) * gp_amplitude[2] ;
}
...

Now, hierarchy can be handled different ways. One way is to impose a hierarchy on the GP parameters (amplitude & lengthscale) alone, in which case you have the above but with a few hyper-parameters:

parameters{
	...
	real<lower=0> gp_lengthscale_mean ;
	real<lower=0> gp_lengthscale_sd ;
	vector<lower=0,offset=gp_lengthscale_mean,multiplier=gp_lengthscale_sd>[2] gp_lengthscale ;
	real<lower=0> gp_amplitude_mean ;
	real<lower=0> gp_amplitude_sd ;
	vector<lower=0,offset=gp_amplitude_mean,multiplier=gp_amplitude_sd>[2] gp_amplitude ;
}
...
model{
	...
	// Priors on hyperparameters
	gp_lengthscale_mean  ~ ... ;
	gp_lengthscale_sd ~ ...;
	gp_amplitude_mean ~ ...;
	gp_amplitude_sd ~ ... ;
	// gp parameters as hierarchical
	gp_lengthscale ~ normal(gp_lengthscale_mean,gp_lengthscale_sd) ;
	gp_amplitude ~ normal(gp_amplitude_mean,gp_amplitude_sd) ;
}

(notes: with a hierarchical structure imposed on positive-continuous variables like these, it usually works best to model them & their hyperparameters as normal on the log scale then exponentiate when they’re used; also, I’ve non-centered using the offset/multiplier syntax here, but see here for code to automate switching between centered/non-centered)

So if y had many columns, each reflecting observations from a uniquely-identifiable unit of observation (ex. human participant in an experiment), the above implements a hierarchy on the GP parameters, and will achieve inference on each individual’s latent GP function. Inference on the across-individual average GP function might be achieved by averaging the individual functions in each sample from the posterior, but I prefer to instead explicitly model an average latent GP from which the individual GPs are deviations:

data{
	int<lower=2> n_xy ;
	array[n_xy] real x ;
	int<lower=2> n_Ids ;
	matrix[n_xy,n_Ids] y ;
}
parameters{
	real grand_mean ;
	real<lower=0> noise_amplitude ;
	matrix[n_xy,n_Ids+1] gp_helper ; //+1 for the group-level GP

	real log_gp_lengthscale_group ;
	real log_gp_lengthscale_mean ;
	real<lower=0> log_gp_lengthscale_sd ;
	vector<lower=0,offset=log_gp_lengthscale_mean,multiplier=log_gp_lengthscale_sd>[n_Ids] gp_lengthscale ;

	real log_gp_amplitude_group ;
	real log_gp_amplitude_mean ;
	real<lower=0> log_gp_amplitude_sd ;
	vector<lower=0,offset=log_gp_amplitude_mean,multiplier=log_gp_amplitude_sd>[n_Ids] log_gp_amplitude ;
}
transformed parameters{
	// concatenate the group-level parameters with the individual level parameters & exponentiate
	vector[n_Ids+1] gp_lengthscale = exp( append_row( log_gp_lengthscale_group, log_gp_lengthscale ) ) ;
	vector[n_Ids+1] gp_amplitude = exp( append_row( log_gp_amplitude_group, log_gp_amplitude ) ) ;
	// gp: latent (noiseless) GP
	matrix[n_xy,n_Ids+1] gp ; // first is the group-level gp
	for( i in 1:(n_Ids+1)){
		gp[,i] = (
			cholesky_decompose(
				add_diag(
					gp_exp_quad_cov(
						x
						, 1.0
						, gp_lengthscale[i]
					)
					,  1e-5 //jitter
				)
			)
			* gp_helper[,i] 
			* gp_amplitude[i]
		) ;
	}
}
model{
	// Priors:
	grand_mean ~ ... ;
	noise_amplitude ~ ... ;
	log_gp_amplitude_group ~ ... ;
	log_gp_lengthscale_group ~ ... ;
	// Priors on hyperparameters
	log_gp_lengthscale_mean  ~ ... ;
	log_gp_lengthscale_sd ~ ...;
	log_gp_amplitude_mean ~ ...;
	log_gp_amplitude_sd ~ ... ;
	// gp parameters as hierarchical
	log_gp_lengthscale ~ normal(log_gp_lengthscale_mean,log_gp_lengthscale_sd) ;
	log_gp_amplitude ~ normal(log_gp_amplitude_mean,log_gp_amplitude_sd) ;
	// gp_helper must be ~ std_normal()
	to_vector(gp_helper) ~ std_normal() ;

	// Likelihood:
	for( i in 1:n_Ids){
		y[,i] ~ normal( grand_mean + gp[,1] + gp[,i+1], noise_amplitude ) ;
	}
}

(note this time I did the log-scale hierarchies as I mentioned above)

And of course you could model noise and maybe the grand mean hierarchically too. And while I’ve implemented the hierarchies independently above, you could use a multivariate normal structure to instead capture correlations (ex. maybe people with relatively high amplitudes also have relatively long lengthscales).

1 Like