Gaussian process with hierarchical Gaussian process prior


#1

I have a data set containing multiple individuals within N2 groups that are related by the covariance matrix A. I would like to model y as a GP where each group has a separate error variance (s2_e) that is the same within a group and model s2_e as a GP. When I run the following model, I get a very low n_eff. Is there a reason for this? Is it a problem with the hacky use of errorvec as an indicator of what s2_e to use?

data {
  int N;
  int N2;
  vector[N] y;
  matrix[N,N] A;
  matrix[N2,N2] A2;
  int errorvec[N]; //input in R as a vector of integers
}
parameters {
  vector<lower=0>[N2] s2_e;
  real<lower=0> s2_e2;
  real<lower=0> mu;
}
model {
  matrix[N, N] K;
  matrix[N2, N2] K2;
  
  for (n in 1:N)
    K[n, n] = A[n, n] + s2_e[errorvec[n]];
    
  for (n in 1:N2)
    K2[n, n] = A2[n, n] + s2_e2;
  
//removed cholesky for readability
  y ~ multi_normal(rep_vector(0, N), K); 
  mu ~ normal(0,1);
  s2_e ~ multi_normal(rep_vector(mu, N2), K2);
  s2_e2 ~ normal(0,1);
  
}

#2

You might improve your N_{eff} by using a non-centered parameterizations (described in the manual in the section “Cholesky Factored and Transformed Implementation”), but I think your model might also have a couple issues.

I think hierarchical GP usually means you have one vector (call it z) that you put a GP prior on with zero mean and some covariance, and then you have a bunch of other vectors (call them q, and let there be N of them), and you give those GP priors with mean equal to z and some covariance. Something like:

z \sim \text{multi_normal}(0, K_z)
q_i \sim \text{multi_normal}(z, K_q)

I could be wrong here though. Check this out: http://mc-stan.org/events/stancon2017-notebooks/stancon2017-trangucci-hierarchical-gps.pdf

I’m also wondering about the s2_e2 variables. That’s quite a few extra degrees of freedom, and even if they’re okay mathematically they can make a model like this a real difficult thing to fit. It might be worth removing them while your developing the rest of the model and add them back later if you want.


#3

s2_e2 is a single parameter. I removed the cholesky factoring for readability as mentioned in the original post.


#4

I think multi_normal_cholesky will give you a speed increase. Usually low NeFF is a result of poor specification, though.

Let’s think about HMC for a second. We’re first using gradients to go to an area of high posterior mass WRT the parameters in your model. We could just be searching some tail of the distribution, that, conditioned on your model, is not able to generate samples well, because, well, there’s nothing there.

Regarding GPs - we generate a covariance matrix with respect to parameters, and we approximate distributions for those parameters. Then, we can generate predictive distributions for the for the kernel we have generated. Because you have a covariance matrix, doesn’t always imply gaussian process. You can really look at covariance between anything and define any covariance function, which could be meaningless.

Here, your covariance matrix is fixed: matrix[N,N] A; For a gaussian process, one might define a covariance function like so. This is called the Squared Exponential kernel by the Stan people, and the Radial Basis Function (RBF) by almost everyone else:

k(x, x') = \sigma^2 exp(\frac{d(x, x')}{2l^2})

So at each iteration your covariance matrix is generated WRT parameters, the magnitude sigma^2 and the length scale l. The magnitude controls deviation vertically (for a 1D) GP, and the length scale controls how much each parameter is related “along a certain axis” meaning, x1 and x2. This implementation isn’t really correct I can see in any way.

Another note, prior distributions on the hyperparameters for GPs is still an open problem. Are you sure this is a good idea? I mean, do whatever you want, but your computational issues could be indicating something. I think Andrew has some saying for that somewhere.

Also, deep GPs are “a thing”. Composite functions, essentially.

With that said, here’s a proper implementation of a gaussian process regression, which can be found in betancourt’s case studies (recommended), among other places.


functions {
  vector gp_pred_rng(vector[] x_pred,
                     vector y1, vector[] x,
                     real magnitude, real length_scale,
                     real sigma) {
    int N = rows(y1);
    int N_pred = size(x_pred);
    vector[N_pred] f2;
    {
      matrix[N, N] K = add_diag(gp_exp_quad_cov(x, magnitude, length_scale),
                                sigma);
      matrix[N, N] L_K = cholesky_decompose(K);
      vector[N] L_K_div_y1 = mdivide_left_tri_low(L_K, y1);
      vector[N] K_div_y1 = mdivide_right_tri_low(L_K_div_y1', L_K)';
      matrix[N, N_pred] k_x_x_pred = gp_exp_quad_cov(x, x_pred, magnitude, length_scale);
      f2 = (k_x_x_pred' * K_div_y1);
    }
    return f2;
  } 
}
data {
  int<lower=1> N;
  int<lower=1> D;
  vector[D] x[N];
  vector[N] y;

  int<lower=1> N_pred;
  vector[D] x_pred[N_pred];
}
transformed data {
  vector[N] mu;
  mu = rep_vector(0, N);
}
parameters {
  real<lower=0> magnitude;
  real<lower=0> length_scale;
  real<lower=0> sig;
  
  real<lower=0> sigma;
}
transformed parameters {
  matrix[N, N] L_K;
  {
    matrix[N, N] K = gp_exp_quad_cov(x, magnitude, length_scale);
    K = add_diag(K, square(sigma));
    L_K = cholesky_decompose(K);
  }
}
model {
  magnitude ~ normal(0, 3);
  length_scale ~ inv_gamma(5, 5);
  sig ~ normal(0, 1);
  
  sigma ~ normal(0, 1);
  
  y ~ multi_normal_cholesky(mu, L_K);
}
generated quantities {
  vector[N_pred] f_pred = gp_pred_rng(x_pred, y, x, magnitude, length_scale, sigma);
  vector[N_pred] y_pred;
  
  for (n in 1:N_pred) y_pred[n] = normal_rng(f_pred[n], sigma);
}

Also, here’s a shameless plug for GPs, one of my many unfinished projects:

Reading this should help with what you’re actually modeling a bit.

Hope this helps!

EDIT: on another note, I was playing with GPs with different noises the other day and HMC/Stan had a lot of trouble with identifiability and some divergences. Terrible posterior geometry.

idk bro good luck


#5

Hey!
Sorry, you mean for example GP with nose that is not Gaussian?
Because i am trying to fit a model based in this logic (my random noise is Student) and although i think
the model is right the resulting plots do not correspond to what i expect


#6

Hi char, can you please post the model that you’re working with and the data, either real data or a simulated dataset? And the plots?


#7

I tried to implement an algorithm which generates the target values with a bit of noise in R. (based on an example in this paper " Monte Carlo Implementation of Gaussian Process Models for Bayesian Regression and Classification",Neil,1997.

Target Distribution with adjusted deviations

f.x = function(x){
s = sample(c(1, 2), size = 1, replace = TRUE, prob = c(0.95, 0.05))
ifelse(s==1,0.3+0.4 x+0.5 sin(2.7 x)+1.1/(1+x^2)+rnorm(1,0,0.1),
0.3+0.4
x+0.5 sin(2.7 x)+1.1/(1+x^2)+rnorm(1,0,1))

}
f.x=Vectorize(f.x)

#Generate train and test set

#set.seed(123)
x = seq(-3,3,length=200)
targets = f.x(x)
dataset = data.frame(cbind(x,targets))
train = (data.frame(cbind(train=dataset$x[1:100],targets=dataset$targets[1:100])))
test = (data.frame(cbind(test=dataset$x[101:200],targets=dataset$targets[101:200])))

I assumed Student distributed noise and my model looks like this:

functions{
// radial basis, or square exponential, function
matrix rbf(int Nx, vector x, int Ny, vector y, real eta, real rho){
matrix[Nx,Ny] K;
for(i in 1:Nx){
for(j in 1:Ny){
K[i,j] = etaexp(-rhopow(x[i]-y[j], 2));
}
}
return K;
}

vector post_pred_rng(real dof,real s,real eta, real rho, real sn, int No, vector xo, int Np, vector xp, vector yobs){
		matrix[No,No] Ko;
		matrix[Np,Np] Kp;
		matrix[No,Np] Kop;
		matrix[Np,No] Ko_inv_t;
		vector[Np] mu_p;
		matrix[Np,Np] Tau;
		matrix[Np,Np] L2;
		vector[Np] f_predic;
		//vector[Np] Yp;
// use of matrix multiplication for the sn noise
		Ko = rbf(No, xo, No, xo, eta, rho) + diag_matrix(rep_vector(1, No))*sn ;
		Kp = rbf(Np, xp, Np, xp, eta, rho) ;
		Kop = rbf(No, xo, Np, xp, eta, rho) ;
		Ko_inv_t = Kop' / Ko;
		mu_p = Ko_inv_t * yobs;
		Tau = Kp - Ko_inv_t * Kop;
		L2 = cholesky_decompose(Tau);
		f_predic = mu_p + L2*rep_vector(normal_rng(0,1), Np);
		return f_predic;
}

}
data{
int<lower=1> N1;
int<lower=1> N2;
vector[N1] X;
vector[N1] Y;
vector[N2] Xp ;
int<lower=1, upper=N2> observed_idx[N1];
}
transformed data{
vector[N1] mu;
for(n in 1:N1) mu[n] = 0;
}
parameters{
real<lower=0> eta_sq;
real<lower=0> inv_rho_sq;
real<lower=5> dof; // Degrees of freedom for t
real<lower=0> s; // scale parametr for T
vector[N2] f_tilde;
}
transformed parameters{
real<lower=0> rho_sq;
rho_sq = inv(inv_rho_sq);

}
model{
matrix[N1,N1] Sigma;
matrix[N1,N1] L_S;
vector[N1] f_predict;
// chol dec
f_tilde ~ normal(0, 1);
Sigma = rbf(N1, X, N1, X, eta_sq, rho_sq) + diag_matrix(rep_vector(1, N1))*s;
L_S = cholesky_decompose(Sigma);
f_predict = L_S * f_tilde;
s ~ student_t(4,0,1);
dof ~ normal(4.5,0.1);
Y ~ student_t(dof,f_predict, s);
eta_sq ~ normal(0,1);
inv_rho_sq ~ cauchy(0,5);
}
generated quantities{
vector[N2] f_pred;
vector[N2] Ypred;
f_pred = post_pred_rng(dof,s,eta_sq, rho_sq, s,N1, X, N2, Xp, Y);
for (n in 1:N2)
Ypred[n] = student_t_rng(dof, f_pred[n], s);

}

this is the resulted plot for student noise