Using 2D Gaussian process predictions within model

Vanessa_Brown · January 27, 2022, 2:17pm

Hi all,

I’m trying to build a model in stan for a task similar to the one in this paper - there are tiles on a 2D grid that are each worth a certain number of points. The point values are spatially correlated on the grid according to a gaussian process, and work has shown that people engage in a GP-like generalization to infer values of hidden tiles. The model has two steps: 1) based on the current and past observed values for chosen tiles on the current grid, estimate the value of each tile (including unseen/unchosen ones) using a GP, and 2) based on the values inferred from the GP and a softmax choice rule, make choices that will maximize point values. I can find examples of GP predictions after model fitting but was having trouble figuring out how to do this appropriately within the model itself. I have a version of the model that runs but it is running very slowly, even with simulated data (hours when it should take a few minutes).

Below is my stan code, R code to run the model in rstan, and attached simulated data.
sim_calc_gp_softmax_sim_beta10_LS2_s20.Rdata (3.0 MB)
First of all, is my implementation of the GP (including the translations between 2D and 1D data) correct? And if so, are there ways to speed this up? Thanks in advance for your help.

Model code:

data {
  int<lower=1> nS; //number of participants
  int<lower=1> nB; //number of blocks per participant
  int<lower=1> nT; //total possible trials per block
  int<lower=1> nL; //number of tiles in grid row/column (assume square)
  
  int<lower=1> b_length [nS,nB]; //trials per block per participant
  
  int selected_x [nS,nB,nT];
  int selected_y [nS,nB,nT];
  int<lower=0,upper=nL*nL> selected_row [nS,nB,nT]; //X+Y coord of selected tile
  //calculate as (X-1)*nL+Y
  real points [nS,nB,nT]; //outcome of each choice
  
  int all_x[nL*nL]; //X coordinates of all tiles
  int all_y[nL*nL]; //Y coordinates of all tiles
}

transformed data {
  real delta = 1e-9;
  real sigma = 0.01; //assume variance
  
  //transformed coordinates
  row_vector[2] all_coords[nL*nL];
  for (ll in 1:(nL*nL)) {
     all_coords[ll,1] = all_x[ll];
     all_coords[ll,2] = all_y[ll];
  }
}

parameters {
  real<lower=0> length_scale; //aka rho or generalization parameter
  // real<lower=0> sigma; //variance
  real<lower=0> beta; //inverse temperature for choice function
}

  
model {
  //set up variables
  vector[nL*nL] pred_f_mu;
  matrix[nL*nL,nL*nL] cov_pred_f;
  matrix[nL*nL,nL*nL] diag_delta;
  vector[nL*nL] pred_f;
  
  //priors
  length_scale ~ inv_gamma(5,5);
  // sigma ~ std_normal();
  beta ~ normal(0,10);
  
  //loop through subjects, blocks, & trials
  for (s in 1:nS) {
    for (b in 1:nB) {
      row_vector[2] selected_coords[nT];
      
      for (t in 1:(b_length[s,b])) {
          
        matrix [t,t] K; //cov kernel for selected tiles
        matrix [t,t] L; //cholesky transformed cov kernel for selected tiles
        vector [t] f; //means of selected tiles
        vector [t] curr_points; //temp vector for type agreement
        
        vector[t] K_div_y1;
        vector[t] K_div_y1_2;
        matrix[t,nL*nL] k_x1_x2;
        matrix[t,nL*nL] v_pred;
        
        vector[t] mu = rep_vector(0,t);
        
        selected_coords[t,1] = selected_x[s,b,t];
        selected_coords[t,2] = selected_y[s,b,t];
        
        //predict values from GP  
        K = cov_exp_quad(selected_coords[1:t,],1.0,length_scale);
        for (tt in 1:(t)) {
          K[tt, tt] = K[tt, tt] + square(sigma);
        }
        L = cholesky_decompose(K);
        
        //fit GP to observed outcomes
        curr_points=to_vector(points[s,b,1:t]);
        curr_points ~ multi_normal_cholesky(mu,L);
        
        //predict all tile values from GP
        K_div_y1 = mdivide_left_tri_low(L,to_vector(points[s,b,1:t]));
        K_div_y1_2 = mdivide_right_tri_low(K_div_y1',L)';
        k_x1_x2 = cov_exp_quad(selected_coords[1:t,],all_coords,1.0,length_scale);
        pred_f_mu = (k_x1_x2' * K_div_y1_2);
      
        //softmax predicting next tile chosen
        if (t<b_length[s,b]) {
          selected_row[s,b,t+1] ~ categorical_logit(beta*pred_f_mu);
        }
        
        
      }
    }
  }
}

r code to run model in rstan:

in_data=load(data_path)

#set up data
nS=length(unique(in_data$ID))
nB=max(in_data$block_num)
nT=max(in_data$trial_block)
nL=11
all_coords=expand.grid(1:nL,1:nL)
all_x=all_coords[,2]
all_y=all_coords[,1]

b_length=array(data=0,dim=c(nS,nB))
block_type=array(data=0,dim=c(nS,nB))
selected_x=array(data=0,dim=c(nS,nB,nT))
selected_y=array(data=0,dim=c(nS,nB,nT))
selected_row=array(data=0,dim=c(nS,nB,nT))
points=array(data=0,dim=c(nS,nB,nT))

in_data$norm_points=(in_data$points-50)/100

for (s in 1:nS) {
  subj_data=in_data[in_data$ID==unique(in_data$ID)[s],]
  for (b in 1:nB) {
    block_data=subj_data[subj_data$block_num==b,]
    b_length[s,b]=dim(block_data)[1]
    block_type[s,b]=block_data$block_risky[1]
    
    for (t in 1:b_length[s,b]) {
      selected_x[s,b,t]=block_data$choice_x[t]+1
      selected_y[s,b,t]=block_data$choice_y[t]+1
      selected_row[s,b,t]=nL*(block_data$choice_x[t])+block_data$choice_y[t]+1
      points[s,b,t]=block_data$norm_points[t]
    }
  }
}


#run stan model
data_toest=list(nS=nS,nB=nB,nT=nT,nL=nL,b_length=b_length,block_type=block_type,
  selected_row=selected_row,points=points,all_x=all_x,all_y=all_y,selected_x=
    selected_x,selected_y=selected_y)
model_out=stan(file=model_name,data=data_toest,verbose=FALSE,save_warmup=FALSE,
    iter=1000,chains=1) #few iters and 1 chain for testing

Bob_Carpenter · February 1, 2022, 10:20pm

This is a complicated model, and I’m afraid I’m not enough of a GP expert to be able to intuit its structure just by looking at it. This one’s especially tricky because of the nested GP covariance.

There’s an example in the user’s guide of how to generate predictions from a GP. We usually do that in generated quantities, so I’m not 100% sure what you mean by doing it in the model. The RNGs could be moved up into the model block if something depended on them, but that wouldn’t then be generating just predictions but would be part of the model. The section to look for is on predictive inference with a Gaussian process.

Vanessa_Brown · February 1, 2022, 11:00pm

Thanks for taking a look at this! Yeah, I cribbed parts of the gp_pred_rng function from that section, but I assumed *_rng functions wouldn’t work in the model block so I tried to preserve the distribution rather than taking random draws from it. I need the predicted values in the model block rather than in generated quantities because they are then used in this line:

selected_row[s,b,t+1] ~ categorical_logit(beta*pred_f_mu);

along with the parameter beta to predict choices from those predicted values.

Let me know if there’s anything else I can clear up about what I’m trying to do.

Vanessa_Brown · February 14, 2022, 2:36pm

hi all,

I’m still stuck on this so any insights would be helpful (tagging a few people: @mike-lawrence @avehtari, any one else?). Besides running slowly, the above model can’t recover the parameters, particularly the length-scale, at all. I tried re-writing it based on Hierarchical spatial GP with uneven sampling coordinates to first calculate all values and then fit to the values of the observed tiles, see below, but this version runs even slower.

data {
  int<lower=1> nS; //number of participants
  int<lower=1> nB; //number of blocks per participant
  int<lower=1> nT; //total possible trials per block
  int<lower=1> nL; //number of tiles in grid row/column (assume square)
  
  int<lower=1> b_length [nS,nB]; //trials per block per participant
  int<lower=0,upper=1> block_type [nS,nB]; //0=safe,1=risky
  
  int selected_x [nS,nB,nT];
  int selected_y [nS,nB,nT];
  int<lower=0,upper=nL*nL> selected_row [nS,nB,nT]; //X+Y coord of selected tile
  //calculate as (X-1)*nL+Y
  real points [nS,nB,nT]; //outcome of each choice
  
  int all_x[nL*nL]; //X coordinates of all tiles
  int all_y[nL*nL]; //Y coordinates of all tiles
}

transformed data {
  real delta = 1e-9;
  real sigma = 0.01; //assume variance
  
  //transformed coordinates
  row_vector[2] all_coords[nL*nL];
  for (ll in 1:(nL*nL)) {
     all_coords[ll,1] = all_x[ll];
     all_coords[ll,2] = all_y[ll];
  }
}

parameters {
  real<lower=0> length_scale; //aka rho or generalization parameter
  // real<lower=0> sigma; //variance
  real<lower=0> beta; //inverse temperature for choice function
 
  vector [nL*nL] eta; //latent variable GP
}
  
model {
  
  //priors
  
  length_scale ~ inv_gamma(5,5);
  // sigma ~ std_normal();
  beta ~ normal(0,10);
  
 eta ~ std_normal();
  
  //loop through subjects, blocks, & trials
  for (s in 1:nS) {
    for (b in 1:nB) {
      if (block_type[s,b]==0) { //only calculate safe blocks for now to help with debugging
        
        //first set up GP-inferred values for all tiles in grid
        matrix[nL*nL,nL*nL] K_all;
        matrix[nL*nL,nL*nL] L_all;
        vector[nL*nL] f_all;
        K_all = cov_exp_quad(all_coords,1,length_scale);
        K_all = add_diag(K_all, 1e-9);
        L_all = cholesky_decompose(K_all);
        f_all = L_all * eta;
        
        for (t in 1:(b_length[s,b])) { //loop through trials in block
          
          //predict points for current trial from GP-inferred values
          points[s,b,t] ~ normal(f_all[selected_row[s,b,t]],sigma);
        
          //softmax predicting next tile chosen
          if (t<b_length[s,b]) {
            selected_row[s,b,t+1] ~ categorical_logit(beta*f_all); 
          }
          
        }
      }
    }
  }
}

mike-lawrence · February 14, 2022, 5:27pm

Over on Slack @avehtari and I have a thread where we recently discussed a path to speed-ups for cases like yours where there is high redundancy in the covariance matrix (in your case thanks to the grid layout of your isotropic space). Code implementing my unique-distances trick is here (though n.b. the proportion-of-variation approach would need tweaking for more appropriate real-world usage), and here’s a plot of the speedups over gp_exp_quad_cov() I get:

Note that the above plot only shows the timings for just the covariance computation, and doesn’t include the more expensive cholesky decomposition time; with that added to both, the max speedup is only about 1.2x. But @avehtari suggests in the Slack thread that the circulant-matrix approach will make the cholesky decomposition (or whatever is used in it’s place; I haven’t clicked through to the examples he provided) should be faster too.

Note that the code at the repo linked above handles the 1D case, but should be easily generalized to your 2D isotropic case.

mike-lawrence · February 14, 2022, 5:41pm

Oh! I also think I see something wrong with your model as you show in your last post. In the model block you’re looping over blocks nested in a loop over subjects, and at the innermost level of the loops you compute K_all/L_all/F_all. But none of those computations involve variables that change across iterations of the loops, so you’re just repeating the same computation nS\times nB times unnecessarily.

Additionally, I presume you intend f_all to be a realization of a latent noiseless GP, but for this to be the case you’d need to multiply by a std_normal() variate. That is, the usual pattern is (here I’m just copying code for a simple 1D GP):

data{
	int<lower=2> n_xy ;
	vector[n_xy] y ;
	array[n_xy] real x ;
}
parameters{
	real grand_mean ;
	real<lower=0> noise_amplitude ;
	real<lower=0> gp_amplitude ;
	real<lower=0> gp_lengthscale ;
	vector[n_xy] gp_helper ;
}
transformed parameters{
	// gp: latent (noiseless) GP
	vector[n_xy] gp = (
		cholesky_decompose(
			add_diag(
				gp_exp_quad_cov(
					x
					, gp_amplitude
					, gp_lengthscale
				)
				, gp_amplitude * 1e-5 //jitter
			)
		)
		* gp_helper
	) ;
}
model{
	// Priors:
	grand_mean ~ ... ;
	noise_amplitude ~ ... ;
	gp_amplitude ~ ... ;
	gp_lengthscale ~ ... ;
	// gp_helper must be ~ std_normal()
	gp_helper ~ std_normal() ;
	// Likelihood:
	y ~ normal( grand_mean + gp , noise_amplitude ) ;
}

Note the parameter gp_helper, which has a std_normal() “prior” and is multiplied by the output of the cholesky decomposition to instantiate the noiseless-GP realization gp. In your first model you employ a centered parameterization that doesn’t have/need that helper variable/multiplication, but I suspect you’ll want to use non-centered as is in your second model given you’re probably in a data-sparse regime.

Finally, when it comes to the later linking of one (or more, if you’re doing a hierarchical model) latent GPs to the likelihood/data, I suspect that indexing will let you skip some loops to instead vectorize (which also opens the avenue for using reduce-sum).

Vanessa_Brown · February 14, 2022, 10:41pm

thank you! much appreciated. I will take a look at your speed up code and try to implement it. Can anyone join the stan slack? I get a ‘need to sign in’ page but can set up an account with that workspace if it’s allowed.

I will also add the std_normal() variate - thanks for catching. GPs are new to me and going between the centered/noncentered version on top of everything else is messing with my head.

For the [K/L/f]_all variables, conceptually I get that they only need to be initialized once. Where I get lost, though, is where the information gained from actually observing the outcomes comes in - how are the predictions for unobserved tiles ([K/L/f]_all variables) getting informed by the observed tiles if those predictions are only initialized once for all grids? I feel like I’m missing something about how this all connects.

avehtari · February 15, 2022, 11:18am

You could use Hilbert space basis function approximation for 2D GP. See details and example in [2004.11408] Practical Hilbert space approximate Bayesian Gaussian processes for probabilistic programming. As HS-GP is presented as linear combination of basis functions, it is very easy to use as part of more elaborate models. See, e.g. Gaussian process demonstration with Stan and Birthdays

mike-lawrence · February 15, 2022, 3:47pm

I don’t yet understand fully what degree of hierarchy you have in your data, but starting from a non-hierarchical case, a given lengthscale and amplitude combination uniquely define a covariance matrix K, which should be computed only once in a model. If you have a model that necessitates multiple realizations from the multivariate normal implied by K, then (taking the non-centered case and referring to variable names of my example) you need a separate gp_helper variable for each realization, which can most easily be achieved by making gp_helper a matrix. To illustrate, let’s say you have two observed-data vectors in y that you want to model as separate GP realizations but with a shared lengthscale & amplitude:

data{
	int<lower=2> n_xy ;
	array[n_xy] real x ;
	matrix[n_xy,2] y ;
}
parameters{
	real grand_mean ;
	real<lower=0> noise_amplitude ;
	real<lower=0> gp_amplitude ;
	real<lower=0> gp_lengthscale ;
	matrix[n_xy,2] gp_helper ;
}
transformed parameters{
	// gp_L: cholesky-decomposed gp covariance
	matrix[n_xy,n_xy] gp_L = cholesky_decompose(
		add_diag(
			gp_exp_quad_cov(
				x
				, gp_amplitude
				, gp_lengthscale
			)
			, gp_amplitude * 1e-5 //jitter
		)
	) ;

	// gp: latent (noiseless) GP
	matrix[n_xy,2] gp ;
	gp[,1] = gp_L * gp_helper[,1] ;
	gp[,2] = gp_L * gp_helper[,2] ;
}
model{
	// Priors:
	grand_mean ~ ... ;
	noise_amplitude ~ ... ;
	gp_amplitude ~ ... ;
	gp_lengthscale ~ ... ;
	// gp_helper must be ~ std_normal()
	to_vector(gp_helper) ~ std_normal() ;
	// Likelihood:
	y[,1] ~ normal( grand_mean + gp[,1] , noise_amplitude ) ;
	y[,2] ~ normal( grand_mean + gp[,2] , noise_amplitude ) ;
}

(note that code is pedagogically-aimed; speedup could be achieved with indexing to vectorize the likelihood)

Again, the above instantiates a model whereby the latent GPs are independent realizations of a common latent multivariate normal, so K only needs to be computed once. K also only needs to be computed once if the two latent GPs share a lengthscale but differ in amplitude, since

cholesky_decompose(gp_exp_quad_cov(x,amp,ls))*helper

is equal to

(cholesky_decompose(gp_exp_quad_cov(x,1.0,ls))*helper)*amp

So in the case of multiple amplitudes but common lengthscale you’d do:

...
parameters{
...
	vector<lower=0>[2] gp_amplitude ;
...
}
transformed parameters{
	// gp_L: cholesky-decomposed gp covariance
	matrix[n_xy,n_xy] gp_L = cholesky_decompose(
		add_diag(
			gp_exp_quad_cov(
				x
				, 1.0
				, gp_lengthscale
			)
			,  1e-5 //jitter
		)
	) ;

	// gp: latent (noiseless) GP
	matrix[n_xy,2] gp ;
	gp[,1] = (gp_L * gp_helper[,1]) * gp_amplitude[1] ;
	gp[,2] = (gp_L * gp_helper[,2]) * gp_amplitude[2] ;
}
...

If the lengthscales are different though, you have to compute the covariance matrix anew for each:

...
parameters{
...
	vector<lower=0>[2] gp_lengthscale ;
...
}
transformed parameters{
	// gp_L: cholesky-decomposed gp covariance
	array[2] matrix[n_xy,n_xy] gp_L ;
	for( i in 1:2){
		gp_L[i] = 	cholesky_decompose(
				add_diag(
					gp_exp_quad_cov(
						x
						, 1.0
						, gp_lengthscale[i]
					)
					,  1e-5 //jitter
				)
			) ;
	}
	// gp: latent (noiseless) GP
	matrix[n_xy,2] gp ;
	gp[,1] = (gp_L[1] * gp_helper[,1]) * gp_amplitude[1] ;
	gp[,2] = (gp_L[2] * gp_helper[,2]) * gp_amplitude[2] ;
}
...

Now, hierarchy can be handled different ways. One way is to impose a hierarchy on the GP parameters (amplitude & lengthscale) alone, in which case you have the above but with a few hyper-parameters:

parameters{
	...
	real<lower=0> gp_lengthscale_mean ;
	real<lower=0> gp_lengthscale_sd ;
	vector<lower=0,offset=gp_lengthscale_mean,multiplier=gp_lengthscale_sd>[2] gp_lengthscale ;
	real<lower=0> gp_amplitude_mean ;
	real<lower=0> gp_amplitude_sd ;
	vector<lower=0,offset=gp_amplitude_mean,multiplier=gp_amplitude_sd>[2] gp_amplitude ;
}
...
model{
	...
	// Priors on hyperparameters
	gp_lengthscale_mean  ~ ... ;
	gp_lengthscale_sd ~ ...;
	gp_amplitude_mean ~ ...;
	gp_amplitude_sd ~ ... ;
	// gp parameters as hierarchical
	gp_lengthscale ~ normal(gp_lengthscale_mean,gp_lengthscale_sd) ;
	gp_amplitude ~ normal(gp_amplitude_mean,gp_amplitude_sd) ;
}

(notes: with a hierarchical structure imposed on positive-continuous variables like these, it usually works best to model them & their hyperparameters as normal on the log scale then exponentiate when they’re used; also, I’ve non-centered using the offset/multiplier syntax here, but see here for code to automate switching between centered/non-centered)

So if y had many columns, each reflecting observations from a uniquely-identifiable unit of observation (ex. human participant in an experiment), the above implements a hierarchy on the GP parameters, and will achieve inference on each individual’s latent GP function. Inference on the across-individual average GP function might be achieved by averaging the individual functions in each sample from the posterior, but I prefer to instead explicitly model an average latent GP from which the individual GPs are deviations:

data{
	int<lower=2> n_xy ;
	array[n_xy] real x ;
	int<lower=2> n_Ids ;
	matrix[n_xy,n_Ids] y ;
}
parameters{
	real grand_mean ;
	real<lower=0> noise_amplitude ;
	matrix[n_xy,n_Ids+1] gp_helper ; //+1 for the group-level GP

	real log_gp_lengthscale_group ;
	real log_gp_lengthscale_mean ;
	real<lower=0> log_gp_lengthscale_sd ;
	vector<lower=0,offset=log_gp_lengthscale_mean,multiplier=log_gp_lengthscale_sd>[n_Ids] gp_lengthscale ;

	real log_gp_amplitude_group ;
	real log_gp_amplitude_mean ;
	real<lower=0> log_gp_amplitude_sd ;
	vector<lower=0,offset=log_gp_amplitude_mean,multiplier=log_gp_amplitude_sd>[n_Ids] log_gp_amplitude ;
}
transformed parameters{
	// concatenate the group-level parameters with the individual level parameters & exponentiate
	vector[n_Ids+1] gp_lengthscale = exp( append_row( log_gp_lengthscale_group, log_gp_lengthscale ) ) ;
	vector[n_Ids+1] gp_amplitude = exp( append_row( log_gp_amplitude_group, log_gp_amplitude ) ) ;
	// gp: latent (noiseless) GP
	matrix[n_xy,n_Ids+1] gp ; // first is the group-level gp
	for( i in 1:(n_Ids+1)){
		gp[,i] = (
			cholesky_decompose(
				add_diag(
					gp_exp_quad_cov(
						x
						, 1.0
						, gp_lengthscale[i]
					)
					,  1e-5 //jitter
				)
			)
			* gp_helper[,i] 
			* gp_amplitude[i]
		) ;
	}
}
model{
	// Priors:
	grand_mean ~ ... ;
	noise_amplitude ~ ... ;
	log_gp_amplitude_group ~ ... ;
	log_gp_lengthscale_group ~ ... ;
	// Priors on hyperparameters
	log_gp_lengthscale_mean  ~ ... ;
	log_gp_lengthscale_sd ~ ...;
	log_gp_amplitude_mean ~ ...;
	log_gp_amplitude_sd ~ ... ;
	// gp parameters as hierarchical
	log_gp_lengthscale ~ normal(log_gp_lengthscale_mean,log_gp_lengthscale_sd) ;
	log_gp_amplitude ~ normal(log_gp_amplitude_mean,log_gp_amplitude_sd) ;
	// gp_helper must be ~ std_normal()
	to_vector(gp_helper) ~ std_normal() ;

	// Likelihood:
	for( i in 1:n_Ids){
		y[,i] ~ normal( grand_mean + gp[,1] + gp[,i+1], noise_amplitude ) ;
	}
}

(note this time I did the log-scale hierarchies as I mentioned above)

And of course you could model noise and maybe the grand mean hierarchically too. And while I’ve implemented the hierarchies independently above, you could use a multivariate normal structure to instead capture correlations (ex. maybe people with relatively high amplitudes also have relatively long lengthscales).

Vanessa_Brown · February 15, 2022, 4:48pm

Thank you Aki! I will take a look at this paper and the case studies to see if I can implement the HS-GP.

Vanessa_Brown · February 15, 2022, 4:55pm

ahhh that was the piece that was missing for me. So I will want to implement multiple realizations of the same covariance matrix but pulling from a different gp_helper. There are actually two semi-hierarchical setups here - I’m assuming the length_scale and beta parameters will differ by person (which is not yet implemented in this version of the code), but each person also sees different grids that I’m assuming have the same length-scale and beta parameters (within each person) but are different realizations of the GP. I’m pretty familiar with hierarchical models outside of GPs so I think I understand this well enough to use now. I’ll try out these suggestions and report back with the (hopefully) functioning version. thanks very very much!

avehtari · February 18, 2022, 7:36pm

In case of HSGPS, when the lengthscale changes, you would need to recompute only the spectral densities which is a vector of length m, where m is the number of basis functions. This is much faster than constructing covariance matrices.

If by the grid you mean different values of x, there is a bit of additional book keeping, but should be doable

Topic		Replies	Views
Could you please help to find out why my GP model got slower Modeling gaussian-process	12	1227	April 7, 2021
Speeding up gaussian process model for spatial prediction Modeling fitting-issues , performance , gaussian-process , spatial	4	1842	July 2, 2020
Gaussian process with hierarchical Gaussian process prior Modeling gaussian-process	15	3007	February 27, 2019
Scalable Gaussian process inference Publicity gaussian-process	27	2397	March 25, 2024
Gaussian Process on HPC \| Issues and speeding up Modeling fitting-issues , performance , gaussian-process , simulation-based-calibration	18	1613	November 25, 2020

Using 2D Gaussian process predictions within model

Related topics