Gaussian process with hierarchical Gaussian process prior

I have a data set containing multiple individuals within N2 groups that are related by the covariance matrix A. I would like to model y as a GP where each group has a separate error variance (s2_e) that is the same within a group and model s2_e as a GP. When I run the following model, I get a very low n_eff. Is there a reason for this? Is it a problem with the hacky use of errorvec as an indicator of what s2_e to use?

data {
  int N;
  int N2;
  vector[N] y;
  matrix[N,N] A;
  matrix[N2,N2] A2;
  int errorvec[N]; //input in R as a vector of integers
}
parameters {
  vector<lower=0>[N2] s2_e;
  real<lower=0> s2_e2;
  real<lower=0> mu;
}
model {
  matrix[N, N] K;
  matrix[N2, N2] K2;
  
  for (n in 1:N)
    K[n, n] = A[n, n] + s2_e[errorvec[n]];
    
  for (n in 1:N2)
    K2[n, n] = A2[n, n] + s2_e2;
  
//removed cholesky for readability
  y ~ multi_normal(rep_vector(0, N), K); 
  mu ~ normal(0,1);
  s2_e ~ multi_normal(rep_vector(mu, N2), K2);
  s2_e2 ~ normal(0,1);
  
}

You might improve your N_{eff} by using a non-centered parameterizations (described in the manual in the section “Cholesky Factored and Transformed Implementation”), but I think your model might also have a couple issues.

I think hierarchical GP usually means you have one vector (call it z) that you put a GP prior on with zero mean and some covariance, and then you have a bunch of other vectors (call them q, and let there be N of them), and you give those GP priors with mean equal to z and some covariance. Something like:

z \sim \text{multi_normal}(0, K_z)
q_i \sim \text{multi_normal}(z, K_q)

I could be wrong here though. Check this out: http://mc-stan.org/events/stancon2017-notebooks/stancon2017-trangucci-hierarchical-gps.pdf

I’m also wondering about the s2_e2 variables. That’s quite a few extra degrees of freedom, and even if they’re okay mathematically they can make a model like this a real difficult thing to fit. It might be worth removing them while your developing the rest of the model and add them back later if you want.

s2_e2 is a single parameter. I removed the cholesky factoring for readability as mentioned in the original post.

1 Like

I think multi_normal_cholesky will give you a speed increase. Usually low NeFF is a result of poor specification, though.

Let’s think about HMC for a second. We’re first using gradients to go to an area of high posterior mass WRT the parameters in your model. We could just be searching some tail of the distribution, that, conditioned on your model, is not able to generate samples well, because, well, there’s nothing there.

Regarding GPs - we generate a covariance matrix with respect to parameters, and we approximate distributions for those parameters. Then, we can generate predictive distributions for the for the kernel we have generated. Because you have a covariance matrix, doesn’t always imply gaussian process. You can really look at covariance between anything and define any covariance function, which could be meaningless.

Here, your covariance matrix is fixed: matrix[N,N] A; For a gaussian process, one might define a covariance function like so. This is called the Squared Exponential kernel by the Stan people, and the Radial Basis Function (RBF) by almost everyone else:

k(x, x') = \sigma^2 exp(\frac{d(x, x')}{2l^2})

So at each iteration your covariance matrix is generated WRT parameters, the magnitude sigma^2 and the length scale l. The magnitude controls deviation vertically (for a 1D) GP, and the length scale controls how much each parameter is related “along a certain axis” meaning, x1 and x2. This implementation isn’t really correct I can see in any way.

Another note, prior distributions on the hyperparameters for GPs is still an open problem. Are you sure this is a good idea? I mean, do whatever you want, but your computational issues could be indicating something. I think Andrew has some saying for that somewhere.

Also, deep GPs are “a thing”. Composite functions, essentially.

With that said, here’s a proper implementation of a gaussian process regression, which can be found in betancourt’s case studies (recommended), among other places.


functions {
  vector gp_pred_rng(vector[] x_pred,
                     vector y1, vector[] x,
                     real magnitude, real length_scale,
                     real sigma) {
    int N = rows(y1);
    int N_pred = size(x_pred);
    vector[N_pred] f2;
    {
      matrix[N, N] K = add_diag(gp_exp_quad_cov(x, magnitude, length_scale),
                                sigma);
      matrix[N, N] L_K = cholesky_decompose(K);
      vector[N] L_K_div_y1 = mdivide_left_tri_low(L_K, y1);
      vector[N] K_div_y1 = mdivide_right_tri_low(L_K_div_y1', L_K)';
      matrix[N, N_pred] k_x_x_pred = gp_exp_quad_cov(x, x_pred, magnitude, length_scale);
      f2 = (k_x_x_pred' * K_div_y1);
    }
    return f2;
  } 
}
data {
  int<lower=1> N;
  int<lower=1> D;
  vector[D] x[N];
  vector[N] y;

  int<lower=1> N_pred;
  vector[D] x_pred[N_pred];
}
transformed data {
  vector[N] mu;
  mu = rep_vector(0, N);
}
parameters {
  real<lower=0> magnitude;
  real<lower=0> length_scale;
  real<lower=0> sig;
  
  real<lower=0> sigma;
}
transformed parameters {
  matrix[N, N] L_K;
  {
    matrix[N, N] K = gp_exp_quad_cov(x, magnitude, length_scale);
    K = add_diag(K, square(sigma));
    L_K = cholesky_decompose(K);
  }
}
model {
  magnitude ~ normal(0, 3);
  length_scale ~ inv_gamma(5, 5);
  sig ~ normal(0, 1);
  
  sigma ~ normal(0, 1);
  
  y ~ multi_normal_cholesky(mu, L_K);
}
generated quantities {
  vector[N_pred] f_pred = gp_pred_rng(x_pred, y, x, magnitude, length_scale, sigma);
  vector[N_pred] y_pred;
  
  for (n in 1:N_pred) y_pred[n] = normal_rng(f_pred[n], sigma);
}

Also, here’s a shameless plug for GPs, one of my many unfinished projects:

Reading this should help with what you’re actually modeling a bit.

Hope this helps!

EDIT: on another note, I was playing with GPs with different noises the other day and HMC/Stan had a lot of trouble with identifiability and some divergences. Terrible posterior geometry.

idk bro good luck

2 Likes

Hey!
Sorry, you mean for example GP with nose that is not Gaussian?
Because i am trying to fit a model based in this logic (my random noise is Student) and although i think
the model is right the resulting plots do not correspond to what i expect

Hi char, can you please post the model that you’re working with and the data, either real data or a simulated dataset? And the plots?

I tried to implement an algorithm which generates the target values with a bit of noise in R. (based on an example in this paper " Monte Carlo Implementation of Gaussian Process Models for Bayesian Regression and Classification",Neil,1997.

Target Distribution with adjusted deviations

f.x = function(x){
s = sample(c(1, 2), size = 1, replace = TRUE, prob = c(0.95, 0.05))
ifelse(s==1,0.3+0.4 x+0.5 sin(2.7 x)+1.1/(1+x^2)+rnorm(1,0,0.1),
0.3+0.4
x+0.5 sin(2.7 x)+1.1/(1+x^2)+rnorm(1,0,1))

}
f.x=Vectorize(f.x)

#Generate train and test set

#set.seed(123)
x = seq(-3,3,length=200)
targets = f.x(x)
dataset = data.frame(cbind(x,targets))
train = (data.frame(cbind(train=dataset$x[1:100],targets=dataset$targets[1:100])))
test = (data.frame(cbind(test=dataset$x[101:200],targets=dataset$targets[101:200])))

I assumed Student distributed noise and my model looks like this:

functions{
// radial basis, or square exponential, function
matrix rbf(int Nx, vector x, int Ny, vector y, real eta, real rho){
matrix[Nx,Ny] K;
for(i in 1:Nx){
for(j in 1:Ny){
K[i,j] = etaexp(-rhopow(x[i]-y[j], 2));
}
}
return K;
}

vector post_pred_rng(real dof,real s,real eta, real rho, real sn, int No, vector xo, int Np, vector xp, vector yobs){
		matrix[No,No] Ko;
		matrix[Np,Np] Kp;
		matrix[No,Np] Kop;
		matrix[Np,No] Ko_inv_t;
		vector[Np] mu_p;
		matrix[Np,Np] Tau;
		matrix[Np,Np] L2;
		vector[Np] f_predic;
		//vector[Np] Yp;
// use of matrix multiplication for the sn noise
		Ko = rbf(No, xo, No, xo, eta, rho) + diag_matrix(rep_vector(1, No))*sn ;
		Kp = rbf(Np, xp, Np, xp, eta, rho) ;
		Kop = rbf(No, xo, Np, xp, eta, rho) ;
		Ko_inv_t = Kop' / Ko;
		mu_p = Ko_inv_t * yobs;
		Tau = Kp - Ko_inv_t * Kop;
		L2 = cholesky_decompose(Tau);
		f_predic = mu_p + L2*rep_vector(normal_rng(0,1), Np);
		return f_predic;
}

}
data{
int<lower=1> N1;
int<lower=1> N2;
vector[N1] X;
vector[N1] Y;
vector[N2] Xp ;
int<lower=1, upper=N2> observed_idx[N1];
}
transformed data{
vector[N1] mu;
for(n in 1:N1) mu[n] = 0;
}
parameters{
real<lower=0> eta_sq;
real<lower=0> inv_rho_sq;
real<lower=5> dof; // Degrees of freedom for t
real<lower=0> s; // scale parametr for T
vector[N2] f_tilde;
}
transformed parameters{
real<lower=0> rho_sq;
rho_sq = inv(inv_rho_sq);

}
model{
matrix[N1,N1] Sigma;
matrix[N1,N1] L_S;
vector[N1] f_predict;
// chol dec
f_tilde ~ normal(0, 1);
Sigma = rbf(N1, X, N1, X, eta_sq, rho_sq) + diag_matrix(rep_vector(1, N1))*s;
L_S = cholesky_decompose(Sigma);
f_predict = L_S * f_tilde;
s ~ student_t(4,0,1);
dof ~ normal(4.5,0.1);
Y ~ student_t(dof,f_predict, s);
eta_sq ~ normal(0,1);
inv_rho_sq ~ cauchy(0,5);
}
generated quantities{
vector[N2] f_pred;
vector[N2] Ypred;
f_pred = post_pred_rng(dof,s,eta_sq, rho_sq, s,N1, X, N2, Xp, Y);
for (n in 1:N2)
Ypred[n] = student_t_rng(dof, f_pred[n], s);

}

this is the resulted plot for student noise

I’m not a developer but a few things… we have our own RBF called cov_exp_quad, so you can just use this, instead of implementing your own. You can then copy-paste the posterior predictive from Betancourt’s case studies.

What are you plotting here? Which parameter? Ypred? The noise in the GP here is your f_tilde parameter.

I really haven’t tried your simulation but I’m not seeing anything wrong. With the student-t likelihood we’re allowing tails than with gaussian likelihood. You can do a quick simulation to compare the different likelihoods.

remember… we’re doing p(\theta | x) \propto p(x | \theta) p(\theta). Some people call the p(x | \theta) noise/observation model, not to be confused with “noise” in a linear model, or whichever context.

2 Likes

Thanks for the reply!
So, in the end when i do ‘student_t_rng’ to get the predictions its like generating samples from the joint posterior which is Student and has the transformed mean ‘f_pred’ right? I am asking because i cant see clearly how HMC is being performed internally.

@char whoops I’m wrong… an noise would be the part where you add diag_matrix. We sometimes define noise as ~normal(0, \sigma_i^2), but you’ve defined a student t noise, and it’s not independent. It’s the same for each observation, I believe. Changing this to be i.i.d could help with model fit.

@Kon7 the end is the posterior predictive. I think in BDA3 they use a tilde to represent posterior predictive. I’m not sure what oyu mean by transformed mean. And no, it’s not clear, we hide the sampler from you (a lot of software hides implementations, and this bothers me too). It lies in the stan-dev/stan repo if you want to sift though the code. It’s an adaptive HMC that was I think mostly Betancourt’s child.

@char

something like

Sigma = rbf(N1, X, N1, X, eta_sq, rho_sq) + diag_matrix(rep_vector(s, N1));

where s \sim normal(0, \sigma_s^2). same s for all observations. This is an assumed measurement error. If the covariance matrix is taking up too much memory and you’d like to take it out of scope, throw curly brackets around it:

{
  Sigma = ...
}

but really:

{
   Sigma = cov_exp_quad(x, eta, rho)
}

but these shouldn’t be squared as that’s done implicitly in cov_exp_quad, which is implemented as:

k(x,x') = \eta^2 exp(\frac{d(x, x')^2 }{2\rho^2}) .

no need to do the transformation. Just please check to see I didn’t mix up the length scale and marginal SD/magnitude parameters.

@anon79882417 the fact that i cannot understand is how the student_t_rng (dof, f_pred[n], s) is supposed to be the predictive posterior? i mean there is a GP prior and a student likelihood right?
So how do we get that the posterior predictive is again a student distro?
Because i red that are different kinds of approximations for GP with non Gaussian likelihood which i cannot understand where and if is in the code…
I mean in the end it seems that it is sampling from the posterior predictive which appears to be a Student although the posterior is not conjugate

Yep, just about. GP prior _on the mean parameter, who’s hyper parameters are the magnitude and length scale.

Sure, so consider this. You’re modeling a binary outcome. What’s the likelihood function? We’ll use a bernoulli. When we want to do predictions out of sample, were not going to switch to a gamma distribution. It doesn’t make sense. We’re still doing predictions for the binary outcome, so we’d use the Bernoulli distribution again. I’d think of the likelihood function as the shape your data takes. If it’s heavily zero inflated, we might consider a tweedie. To consider priors, we need to take into account how the parameters interact with the likelihood. This changes by distribution. For example, consider the tweedie: there’s a mean parameter, and a dispersion parameter, which, well, control the mean and dispersion. We can set informative priors to make the model more realistic.

You’re right, it’s not conjugate. Conjugate means the prior has the same family of distributions as the posterior. These can be derived analytically. Student T and non Gaussian likelihood’s we need a sampler that can sample from non-traceable posteriors, like the metropolis Hastings (or, our fav, the adaptive HMC).

EDIT: there were some typos, I didn’t check for accuracy. I corrected the parameter names in the first response.

2 Likes

Thank you for the response.
To see if i understood correctly, the internal algorithm of Stan (in the case above) takes samples directly from the ‘likelihood’ for unobserved cases,updating at the same time the posterior hyperparameters?
So there is no need of writing down any posterior approximation for non conjugate cases?

yep, exactly. Have you studied the metropolis-hastings algorithm? In a gist, we can use a gaussian proposal distribution, take the ratio of proposed and the last draw. If it’s more likely to be a draw from the joint posterior, we accept. If not, we take another proposal. This will converge in measure to the joint posterior (meaning will converge, in finite time, in an N-dimensional space). With HMC, we add a couple steps: gradients, and a leapfrom integrator. There’s a great paper called “a conceptual introduction to HMC”, but I prefer to look at the algorithm written out, or code for a simple model. All a matter of preference.

Perfect,thanks!