Gaussian process with hierarchical Gaussian process prior

I think multi_normal_cholesky will give you a speed increase. Usually low NeFF is a result of poor specification, though.

Let’s think about HMC for a second. We’re first using gradients to go to an area of high posterior mass WRT the parameters in your model. We could just be searching some tail of the distribution, that, conditioned on your model, is not able to generate samples well, because, well, there’s nothing there.

Regarding GPs - we generate a covariance matrix with respect to parameters, and we approximate distributions for those parameters. Then, we can generate predictive distributions for the for the kernel we have generated. Because you have a covariance matrix, doesn’t always imply gaussian process. You can really look at covariance between anything and define any covariance function, which could be meaningless.

Here, your covariance matrix is fixed: matrix[N,N] A; For a gaussian process, one might define a covariance function like so. This is called the Squared Exponential kernel by the Stan people, and the Radial Basis Function (RBF) by almost everyone else:

k(x, x') = \sigma^2 exp(\frac{d(x, x')}{2l^2})

So at each iteration your covariance matrix is generated WRT parameters, the magnitude sigma^2 and the length scale l. The magnitude controls deviation vertically (for a 1D) GP, and the length scale controls how much each parameter is related “along a certain axis” meaning, x1 and x2. This implementation isn’t really correct I can see in any way.

Another note, prior distributions on the hyperparameters for GPs is still an open problem. Are you sure this is a good idea? I mean, do whatever you want, but your computational issues could be indicating something. I think Andrew has some saying for that somewhere.

Also, deep GPs are “a thing”. Composite functions, essentially.

With that said, here’s a proper implementation of a gaussian process regression, which can be found in betancourt’s case studies (recommended), among other places.


functions {
  vector gp_pred_rng(vector[] x_pred,
                     vector y1, vector[] x,
                     real magnitude, real length_scale,
                     real sigma) {
    int N = rows(y1);
    int N_pred = size(x_pred);
    vector[N_pred] f2;
    {
      matrix[N, N] K = add_diag(gp_exp_quad_cov(x, magnitude, length_scale),
                                sigma);
      matrix[N, N] L_K = cholesky_decompose(K);
      vector[N] L_K_div_y1 = mdivide_left_tri_low(L_K, y1);
      vector[N] K_div_y1 = mdivide_right_tri_low(L_K_div_y1', L_K)';
      matrix[N, N_pred] k_x_x_pred = gp_exp_quad_cov(x, x_pred, magnitude, length_scale);
      f2 = (k_x_x_pred' * K_div_y1);
    }
    return f2;
  } 
}
data {
  int<lower=1> N;
  int<lower=1> D;
  vector[D] x[N];
  vector[N] y;

  int<lower=1> N_pred;
  vector[D] x_pred[N_pred];
}
transformed data {
  vector[N] mu;
  mu = rep_vector(0, N);
}
parameters {
  real<lower=0> magnitude;
  real<lower=0> length_scale;
  real<lower=0> sig;
  
  real<lower=0> sigma;
}
transformed parameters {
  matrix[N, N] L_K;
  {
    matrix[N, N] K = gp_exp_quad_cov(x, magnitude, length_scale);
    K = add_diag(K, square(sigma));
    L_K = cholesky_decompose(K);
  }
}
model {
  magnitude ~ normal(0, 3);
  length_scale ~ inv_gamma(5, 5);
  sig ~ normal(0, 1);
  
  sigma ~ normal(0, 1);
  
  y ~ multi_normal_cholesky(mu, L_K);
}
generated quantities {
  vector[N_pred] f_pred = gp_pred_rng(x_pred, y, x, magnitude, length_scale, sigma);
  vector[N_pred] y_pred;
  
  for (n in 1:N_pred) y_pred[n] = normal_rng(f_pred[n], sigma);
}

Also, here’s a shameless plug for GPs, one of my many unfinished projects:

Reading this should help with what you’re actually modeling a bit.

Hope this helps!

EDIT: on another note, I was playing with GPs with different noises the other day and HMC/Stan had a lot of trouble with identifiability and some divergences. Terrible posterior geometry.

idk bro good luck

2 Likes