Gaussian Process on HPC | Issues and speeding up

Hello, I’m working on a Bayesian calibration problem of a computer experiment. To do so, I’m using a Gaussian Process as an emulator to a computer model and I’m training and calibrating the emulator at the same time as per Higdon et al. (2004). I’m running the calibration on a High Performance Computing (HPC) facility as a shared memory threaded parallel job. I run 4 chains of 1000 iterations. For training points (N) less than 600, the model works fine with no issues (see image attached of Comp. Time (mins) vs N) as judged by R and nEff > 10%. When I tried to further increase N by 140 points (total of 728), and with an allocated RAM per core of 16 GB, it took 22 hours for one chain to complete and none of the other chains finished their 1000 iterations – two of them were stuck at 250 iterations. I don’t understand what is happening, so any help would be greatly appreciated! Specifically:

  1. As training points (N) increase, I expect an increase in computational time of N^3 and memory requirement of N^2. I assumed that 16GB of RAM would suffice - could I be running out of RAM or is there something else that’s wrong? Does memory grow with iterations as well as N?
  2. A small constant was not added to the diagonal of the covariance matrix as it’s often recommended since I’ve based my code from a publication by Chong et al. (2018) that didn’t include this component. I’m more than happy to add it but do you think that this could be the reason I’m facing these issues?
  3. Do you have any suggestions on how to speed up the calibration? This would be regardless of the problem I’ve described above. I have seen a couple of suggestions on some different topics although I don’t know whether they would apply on this specific problem.

timevspoints

data {
  int<lower=0> n; // number of field data
  int<lower=0> m; // number of computer simulation
  int<lower=0> p; // number of observable inputs x
  int<lower=0> q; // number of calibration parameters t
  vector[n] y; // field observations
  vector[m] eta; // output of computer simulations
  matrix[n, p] xf; // observable inputs corresponding to y
  // (xc, tc): design points corresponding to eta
  matrix[m, p] xc; 
  matrix[m, q] tc; 
}

// Need to combine y (observations) and eta (simulations) 
// into a single vector to establish statistical relation
// Chong & Menberg (2018), Hidgon et al. (2004)
transformed data {
  int<lower=1> N;
  vector[n+m] z; // z = [y, eta]
  vector[n+m] mu; // mean vector

  N = n + m;
  // set mean vector to zero
  for (i in 1:N) {
    mu[i] = 0;
  }
  z = append_row(y, eta); 
}

parameters {
  // tf: calibration parameters
  // rho_eta: reparameterization of beta_eta (the correlation parameters of the simulator)
  // rho_delta: reparameterization of beta_delta (the correlation parameter of the model discrepancy)
  // lambda_eta: precision parameter for eta 
  // lambda_delta: precision parameter for bias term
  // lambda_e: precision parameter of observation error
  row_vector<lower=0,upper=1>[q] tf; 
  row_vector<lower=0,upper=1>[p+q] rho_eta;
  row_vector<lower=0,upper=1>[p] rho_delta;
  real<lower=0> lambda_eta; 
  real<lower=0> lambda_delta; 
  real<lower=0> lambda_e; 
}

transformed parameters {
  // beta_delta: correlation parameter for bias term
  // beta_eta: correlation parameter of observation error
  row_vector[p+q] beta_eta;
  row_vector[p] beta_delta;
  beta_eta = -4.0 * log(rho_eta);
  beta_delta = -4.0 * log(rho_delta);
}

// Create GP model based on the definitions from above
model {
  // declare variables
  matrix[N, (p+q)] xt;
  matrix[N, N] sigma_eta; // simulator covariance
  matrix[n, n] sigma_delta; // bias term covariance
  matrix[N, N] sigma_z; // covariance matrix
  matrix[N, N] L; // cholesky decomposition of covariance matrix 
  row_vector[p] temp_delta;
  row_vector[p+q] temp_eta;

  // field observation (xf) and calibration variables (tf)
  // are placed together in a matrix with the
  // computer observation (xc) and calibration variables (xc)
  // xt = [[xf,tf],[xc,tc]]
  xt[1:n, 1:p] = xf; // field observations
  xt[(n+1):N, 1:p] = xc; // computer observations (assume to be the same as xf)
  xt[1:n, (p+1):(p+q)] = rep_matrix(tf, n);
  xt[(n+1):N, (p+1):(p+q)] = tc; // computer calibration variables

  // diagonal elements of sigma_eta
  sigma_eta = diag_matrix(rep_vector((1 / lambda_eta), N));
 
  // off-diagonal elements of sigma_eta
  // for the squared covariance (alpha = 2)
  // xt[i] is row i and xt[j] is row j
  for (i in 1:(N-1)) {
    for (j in (i+1):N) {
      temp_eta = xt[i] - xt[j]; # Subtract row i from row j
      sigma_eta[i, j] = beta_eta .* temp_eta * temp_eta'; # 
      sigma_eta[i, j] = exp(-sigma_eta[i, j]) / lambda_eta;
      sigma_eta[j, i] = sigma_eta[i, j];
    }
  }

  // diagonal elements of sigma_delta
  sigma_delta = diag_matrix(rep_vector((1 / lambda_delta), n));

  // off-diagonal elements of sigma_delta
  for (i in 1:(n-1)) {
    for (j in (i+1):n) {
      temp_delta = xf[i] - xf[j];
      sigma_delta[i, j] = beta_delta .* temp_delta * temp_delta';
      sigma_delta[i, j] = exp(-sigma_delta[i, j]) / lambda_delta;
      sigma_delta[j, i] = sigma_delta[i, j];
    }   
  }

  // computation of covariance matrix sigma_z 
  sigma_z = sigma_eta;
  sigma_z[1:n, 1:n] = sigma_eta[1:n, 1:n] + sigma_delta;

  // add observation errors
  for (i in 1:n) {
    sigma_z[i, i] = sigma_z[i, i] + (1.0 / lambda_e);
  }  
  
  //print(sigma_z)
  
  // Specify hyperparameters here - based on Chong et al.(2018)
  rho_eta[1:(p+q)] ~ beta(1.0, 0.3);
  rho_delta[1:p] ~ beta(1.0, 0.3);
  lambda_eta ~ gamma(10, 10); // gamma (shape, rate)
  lambda_delta ~ gamma(10, 0.3); 
  lambda_e ~ gamma(10, 0.03); 

  // Specify priors here - these have a physical meaning in the computer software being calibrated
  tf[1] ~ weibull(2.724,0.417); 
  tf[2] ~ uniform(1.737e-05,1); 
  tf[3] ~ normal(0.472,0.140);
  tf[4] ~ gamma(6.90965275109803,19.158); 
  tf[5] ~ lognormal(-1.983,0.640);
  
  L = cholesky_decompose(sigma_z); // cholesky decomposition
  z ~ multi_normal_cholesky(mu, L);
}

Edit
One of my simulations (with 750 iters) just finished and I noticed that one of the chains did not mix at all as can be seen below:

trace_plot

From the pairs plot attached, I’m thinking that there could be something funny going one with beta_eta1 and beta_eta6 although I might be wrong!

Any insights on how to best tackle this? Is this related to the model definition and specifically the priors used?

1 Like

The direct way of implementing GP with covariance matrix built in Stan code, makes the autodiff tree to have n^2 nodes and it’s likely that you are running out of memory. At the moment you have two options to reduce the memory usage

There are some changes coming in the future to Stan that will make GPs faster, but it’s difficult to predict when all the necessary pieces needed are there.

5 Likes

Thanks for your reply! I’ve tried implementing the built in cov_exp_quad function and tested its performance on small dataset. I found the computational time to be longer when built-in function was used compared to my previous formulation. I’m not sure why that’s the case but I’m suspecting that it is due to having to use it multiple times to define multiple length-scales (please see my function definition below for more detail). If there is a more efficient way to use this function when defining multiple length-scales, please let me know.

My problem has more than 3D so I don’t think I would be able to use the paper suggested. However, I think I might be able to make use of the Kronecker Product approach described here: https://sethrf.com/files/fast-hierarchical-GPs.pdf Since the paper is from 2017, is there a built-in Kronecker product implementation of Gaussian Processes that differs from that paper? I couldn’t find one but I just thought I would check.

Computational Time for previous implementation:

        warmup sample
chain:1 87.015 49.680
chain:2 92.540 48.463

Computational Time for built-in function:

         warmup sample
chain:1 133.363 75.504
chain:2 141.954 73.151
functions {
  real c_alpha(real lambda){
    real alpha_dem = sqrt(lambda);
    real alpha = 1 / alpha_dem;
    return alpha;
  }

  real c_rho(real beta){
    real beta_dem = sqrt(2.0 * beta);
    real rho = 1 / beta_dem;
    return rho;
  }
  
  matrix c_sigma7(matrix X, real lambda, row_vector beta, int N){
    
    matrix[N, N] sigma;
    
    sigma = cov_exp_quad(to_array_1d(X[,1]), c_alpha(lambda), c_rho(beta[1]))
    .* cov_exp_quad(to_array_1d(X[,2]), 1, c_rho(beta[2]))
    .* cov_exp_quad(to_array_1d(X[,3]), 1, c_rho(beta[3]))
    .* cov_exp_quad(to_array_1d(X[,4]), 1, c_rho(beta[4]))
    .* cov_exp_quad(to_array_1d(X[,5]), 1, c_rho(beta[5]))
    .* cov_exp_quad(to_array_1d(X[,6]), 1, c_rho(beta[6]))
    .* cov_exp_quad(to_array_1d(X[,7]), 1, c_rho(beta[7]));
    
    return(sigma);
  }
  
  matrix c_sigma2(matrix X, real lambda, row_vector beta, int N){
    
    matrix[N, N] sigma;
    
    sigma = cov_exp_quad(to_array_1d(X[,1]), c_alpha(lambda), c_rho(beta[1]))
    .* cov_exp_quad(to_array_1d(X[,2]), 1, c_rho(beta[2]));
    
    return(sigma);
  }
}

data {
  int<lower=0> n; // number of field data
  int<lower=0> m; // number of computer simulation
  int<lower=0> p; // number of observable inputs x
  int<lower=0> q; // number of calibration parameters t
  vector[n] y; // field observations
  vector[m] eta; // output of computer simulations
  matrix[n, p] xf; // observable inputs corresponding to y
  // (xc, tc): design points corresponding to eta
  matrix[m, p] xc; 
  matrix[m, q] tc; 
}

// Need to combine y (observations) and eta (simulations) 
// into a single vector to establish statistical relation
// Chong & Menberg (2018), Hidgon et al. (2004)
transformed data {
  int<lower=1> N;
  vector[n+m] z; // z = [y, eta]
  vector[n+m] mu; // mean vector

  N = n + m;
  // set mean vector to zero
  for (i in 1:N) {
    mu[i] = 0;
  }
  z = append_row(y, eta); 
}

parameters {
  // tf: calibration parameters
  // rho_eta: reparameterization of beta_eta (the correlation parameters of the simulator)
  // rho_delta: reparameterization of beta_delta (the correlation parameter of the model discrepancy)
  // lambda_eta: precision parameter for eta 
  // lambda_delta: precision parameter for bias term
  // lambda_e: precision parameter of observation error
  row_vector<lower=0,upper=1>[q] tf; 
  row_vector<lower=0,upper=1>[p+q] rho_eta;
  row_vector<lower=0,upper=1>[p] rho_delta;
  real<lower=0> lambda_eta; 
  real<lower=0> lambda_delta; 
  real<lower=0> lambda_e; 
}

transformed parameters {
  // beta_delta: correlation parameter for bias term
  // beta_eta: correlation parameter of observation error
  row_vector[p+q] beta_eta;
  row_vector[p] beta_delta;
  beta_eta = -4.0 * log(rho_eta);
  beta_delta = -4.0 * log(rho_delta);
}

// Create GP model based on the definitions from above
model {
  // declare variables
  matrix[N, (p+q)] xt;
  matrix[N, N] sigma_eta; // simulator covariance
  matrix[n, n] sigma_delta; // bias term covariance
  matrix[N, N] sigma_z; // covariance matrix
  matrix[N, N] L; // cholesky decomposition of covariance matrix 
  row_vector[p] temp_delta;
  row_vector[p+q] temp_eta;

  // field observation (xf) and calibration variables (tf)
  // are placed together in a matrix with the
  // computer observation (xc) and calibration variables (xc)
  // xt = [[xf,tf],[xc,tc]]
  xt[1:n, 1:p] = xf; // field observations
  xt[(n+1):N, 1:p] = xc; // computer observations (assume to be the same as xf)
  xt[1:n, (p+1):(p+q)] = rep_matrix(tf, n);
  xt[(n+1):N, (p+1):(p+q)] = tc; // computer calibration variables

  // computeation of covariance matrix sigma_eta
  sigma_eta = c_sigma7(xt, lambda_eta, beta_eta, N);
  
  // computeation of covariance matrix sigma_delta (bias)
  sigma_delta = c_sigma2(xf, lambda_delta, beta_delta, n);

  //for (j in 1:2){
  //  print("u[", j, "] = ", sigma_delta[j]);
  //}
  
  // computation of covariance matrix sigma_z 
  sigma_z = sigma_eta;
  sigma_z[1:n, 1:n] = sigma_eta[1:n, 1:n] + sigma_delta;

  // add observation errors
  for (i in 1:n) {
    sigma_z[i, i] = sigma_z[i, i] + (1.0 / lambda_e);
  }  
  
  //print(sigma_z)
  
  // Specify hyperparameters here
  rho_eta[1:(p+q)] ~ beta(1.0, 0.3);
  rho_delta[1:p] ~ beta(1.0, 0.3);
  lambda_eta ~ gamma(10, 10); // gamma (shape, rate)
  lambda_delta ~ gamma(10, 0.3); 
  lambda_e ~ gamma(10, 0.03); 

  // Specify priors here
  tf[1] ~ weibull(2.724,0.417); 
  tf[2] ~ uniform(1.737e-05,1); 
  tf[3] ~ normal(0.472,0.140);
  tf[4] ~ gamma(6.90965275109803,19.158); 
  tf[5] ~ lognormal(-1.983,0.640);
  
  L = cholesky_decompose(sigma_z); // cholesky decomposition
  z ~ multi_normal_cholesky(mu, L);
}

Yes, it’s the multiple calls and multiple matrix pointwise multiplications that are now making the autodiff tree big and slow.

@bbbales2 can you remind us about the status of vector length scale for cov_exp_quad?

If that’s not yet in (at least document doesn’t show it), you would get better performance by scaling columns of X first using diag_post_multiply and then calling cov_exp_quad once.

It’s possible that in this approach the autodiff part will again be dominating the computation time.

No.

1 Like

Non-existent, I’m afraid.

Yes do this. Something like:

diag_post_multiply(X, inv(beta))

I think inv is vectorized.

Edit: updated eq
Edit2: unupdated eq. Had it right the first time whoops

2 Likes

Thank you both for the help.

After looking at the definition of the cov_exp_quad I can’t see a version that may be applied to a matrix. In my case, X is a [m, p] matrix and therefore I think I would need to still apply cov_exp_quad. After applying diag_post_multiply(X, inv(beta)) on X, the output would still be a matrix which as far as I can tell, cov_exp_quad can’t handle. I’m probably missing something obvious here, but how could I deal with that?

Oh ouch we never doc’d the functions that do this apparently.

I think something like this should work, might have to do some debugging:

matrix c_sigma7(matrix X, real lambda, row_vector beta, int N) {
  matrix[N, N] scaled_X = diag_post_multiply(X, inv(beta));
  matrix[N, N] Sigma;

  for(i in 1:N) {
    Sigma[i, i] = -0.5 * square(lambda);
    for(j in (i + 1):N) {
      Sigma[i, j] = -0.5 * squared_distance(scaled_X[i,], scale_X[j,]);
      Sigma[j, i] = Sigma[i, j];
    }
  }

  return lambda * Sigma;
}

(Edit: forgot lambda)

2 Likes

Many thanks for clarifying! After a couple of changes (e.g. taking the exponential of the squared distance) I got it work. I compared it against my original attempt (original post) and it’s faster! I added the results below for reference.

If it’s not too much trouble, could you briefly explain to me why this has led to a decrease in computational cost, especially in the warmup, even though it’s relying on a for loop?

In addition, I’m using get_elapsed_time to compare the computational time. Is there something similar for memory use?

New Function:

  matrix c_sigma(matrix X, real lambda, row_vector beta, int N, int p, int q) {
  matrix[N, (p+q)] scaled_X = diag_post_multiply(X, square(beta));
  matrix[N, N] Sigma;
  real lambda_inv = 1.0 / lambda;

  for(i in 1:N) {
    Sigma[i, i] = lambda_inv;
    for(j in (i + 1):N) {
      Sigma[i, j] = exp(-squared_distance(scaled_X[i,], scaled_X[j,])) * lambda_inv;
      Sigma[j, i] = Sigma[i, j];
    }
  }

  return Sigma;
  }

Original Script:

         warmup sample
chain:1 121.344 67.152
chain:2 128.246 68.578

New Script (using diag_post_multiply):

        warmup sample
chain:1 72.105 41.812
chain:2 70.929 40.432
2 Likes

Hahaha, I guess this is why debugging code is important.

In terms of memory, the way to think about how automatic differentiation works in Stan, there is an extra little piece of information saved for every mathematical operation. That means every time you add two scalars together (a + b), a record of that operation is saved. For bigger operations, (like the squared_distance operator), we’ve coded this up so that information is small, but it’s still something.

If you’re building an NxN matrix, that’s at O(N^2) of these things. If you build 7 O(N^2) matrices, that’ll be 7x the cost. In this rewrite we put the 7x bit inside squared_distance, which will use less memory than if you broke it into its individual operations (sum(square(x_scaled[i, ] - x_scaled[j, ]))). So that’s the trick.

Anyway this back of the envelope stuff is all very inexact. @rok_cesnovar is working on some tools to debug this.

2 Likes

Hmm, it’s there in C++ and I think it’s in the language too, just not doc’d. Your comments on the docs (https://github.com/stan-dev/docs/pull/272) are good, I just haven’t gotten around to following up yet.

I thought about this but I’m just hesitant to tell people about functions that aren’t doc’d yet. Too confusing. It will hopefully soon be fixed.

Although inexact, it was useful to read - thank you for taking the time to explain this to me!

Thanks for letting me know!

1 Like

Oh yeah hey, follow up, I talked to someone and realized my comments about “Non-existent, I’m afraid” sounds really dismissive of the work you did. I didn’t mean it that way, my apologies (at some point @avehtari had asked me to look at a multiple length scale thing and I never did it – I was really thinking about that). We appreciate your contributions.

3 Likes