# Gaussian Process on HPC | Issues and speeding up

Hello, I’m working on a Bayesian calibration problem of a computer experiment. To do so, I’m using a Gaussian Process as an emulator to a computer model and I’m training and calibrating the emulator at the same time as per Higdon et al. (2004). I’m running the calibration on a High Performance Computing (HPC) facility as a shared memory threaded parallel job. I run 4 chains of 1000 iterations. For training points (N) less than 600, the model works fine with no issues (see image attached of Comp. Time (mins) vs N) as judged by R and nEff > 10%. When I tried to further increase N by 140 points (total of 728), and with an allocated RAM per core of 16 GB, it took 22 hours for one chain to complete and none of the other chains finished their 1000 iterations – two of them were stuck at 250 iterations. I don’t understand what is happening, so any help would be greatly appreciated! Specifically:

1. As training points (N) increase, I expect an increase in computational time of N^3 and memory requirement of N^2. I assumed that 16GB of RAM would suffice - could I be running out of RAM or is there something else that’s wrong? Does memory grow with iterations as well as N?
2. A small constant was not added to the diagonal of the covariance matrix as it’s often recommended since I’ve based my code from a publication by Chong et al. (2018) that didn’t include this component. I’m more than happy to add it but do you think that this could be the reason I’m facing these issues?
3. Do you have any suggestions on how to speed up the calibration? This would be regardless of the problem I’ve described above. I have seen a couple of suggestions on some different topics although I don’t know whether they would apply on this specific problem.

data {
int<lower=0> n; // number of field data
int<lower=0> m; // number of computer simulation
int<lower=0> p; // number of observable inputs x
int<lower=0> q; // number of calibration parameters t
vector[n] y; // field observations
vector[m] eta; // output of computer simulations
matrix[n, p] xf; // observable inputs corresponding to y
// (xc, tc): design points corresponding to eta
matrix[m, p] xc;
matrix[m, q] tc;
}

// Need to combine y (observations) and eta (simulations)
// into a single vector to establish statistical relation
// Chong & Menberg (2018), Hidgon et al. (2004)
transformed data {
int<lower=1> N;
vector[n+m] z; // z = [y, eta]
vector[n+m] mu; // mean vector

N = n + m;
// set mean vector to zero
for (i in 1:N) {
mu[i] = 0;
}
z = append_row(y, eta);
}

parameters {
// tf: calibration parameters
// rho_eta: reparameterization of beta_eta (the correlation parameters of the simulator)
// rho_delta: reparameterization of beta_delta (the correlation parameter of the model discrepancy)
// lambda_eta: precision parameter for eta
// lambda_delta: precision parameter for bias term
// lambda_e: precision parameter of observation error
row_vector<lower=0,upper=1>[q] tf;
row_vector<lower=0,upper=1>[p+q] rho_eta;
row_vector<lower=0,upper=1>[p] rho_delta;
real<lower=0> lambda_eta;
real<lower=0> lambda_delta;
real<lower=0> lambda_e;
}

transformed parameters {
// beta_delta: correlation parameter for bias term
// beta_eta: correlation parameter of observation error
row_vector[p+q] beta_eta;
row_vector[p] beta_delta;
beta_eta = -4.0 * log(rho_eta);
beta_delta = -4.0 * log(rho_delta);
}

// Create GP model based on the definitions from above
model {
// declare variables
matrix[N, (p+q)] xt;
matrix[N, N] sigma_eta; // simulator covariance
matrix[n, n] sigma_delta; // bias term covariance
matrix[N, N] sigma_z; // covariance matrix
matrix[N, N] L; // cholesky decomposition of covariance matrix
row_vector[p] temp_delta;
row_vector[p+q] temp_eta;

// field observation (xf) and calibration variables (tf)
// are placed together in a matrix with the
// computer observation (xc) and calibration variables (xc)
// xt = [[xf,tf],[xc,tc]]
xt[1:n, 1:p] = xf; // field observations
xt[(n+1):N, 1:p] = xc; // computer observations (assume to be the same as xf)
xt[1:n, (p+1):(p+q)] = rep_matrix(tf, n);
xt[(n+1):N, (p+1):(p+q)] = tc; // computer calibration variables

// diagonal elements of sigma_eta
sigma_eta = diag_matrix(rep_vector((1 / lambda_eta), N));

// off-diagonal elements of sigma_eta
// for the squared covariance (alpha = 2)
// xt[i] is row i and xt[j] is row j
for (i in 1:(N-1)) {
for (j in (i+1):N) {
temp_eta = xt[i] - xt[j]; # Subtract row i from row j
sigma_eta[i, j] = beta_eta .* temp_eta * temp_eta'; #
sigma_eta[i, j] = exp(-sigma_eta[i, j]) / lambda_eta;
sigma_eta[j, i] = sigma_eta[i, j];
}
}

// diagonal elements of sigma_delta
sigma_delta = diag_matrix(rep_vector((1 / lambda_delta), n));

// off-diagonal elements of sigma_delta
for (i in 1:(n-1)) {
for (j in (i+1):n) {
temp_delta = xf[i] - xf[j];
sigma_delta[i, j] = beta_delta .* temp_delta * temp_delta';
sigma_delta[i, j] = exp(-sigma_delta[i, j]) / lambda_delta;
sigma_delta[j, i] = sigma_delta[i, j];
}
}

// computation of covariance matrix sigma_z
sigma_z = sigma_eta;
sigma_z[1:n, 1:n] = sigma_eta[1:n, 1:n] + sigma_delta;

for (i in 1:n) {
sigma_z[i, i] = sigma_z[i, i] + (1.0 / lambda_e);
}

//print(sigma_z)

// Specify hyperparameters here - based on Chong et al.(2018)
rho_eta[1:(p+q)] ~ beta(1.0, 0.3);
rho_delta[1:p] ~ beta(1.0, 0.3);
lambda_eta ~ gamma(10, 10); // gamma (shape, rate)
lambda_delta ~ gamma(10, 0.3);
lambda_e ~ gamma(10, 0.03);

// Specify priors here - these have a physical meaning in the computer software being calibrated
tf[1] ~ weibull(2.724,0.417);
tf[2] ~ uniform(1.737e-05,1);
tf[3] ~ normal(0.472,0.140);
tf[4] ~ gamma(6.90965275109803,19.158);
tf[5] ~ lognormal(-1.983,0.640);

L = cholesky_decompose(sigma_z); // cholesky decomposition
z ~ multi_normal_cholesky(mu, L);
}


Edit
One of my simulations (with 750 iters) just finished and I noticed that one of the chains did not mix at all as can be seen below:

From the pairs plot attached, I’m thinking that there could be something funny going one with beta_eta1 and beta_eta6 although I might be wrong!

Any insights on how to best tackle this? Is this related to the model definition and specifically the priors used?

1 Like

The direct way of implementing GP with covariance matrix built in Stan code, makes the autodiff tree to have n^2 nodes and it’s likely that you are running out of memory. At the moment you have two options to reduce the memory usage

There are some changes coming in the future to Stan that will make GPs faster, but it’s difficult to predict when all the necessary pieces needed are there.

5 Likes

Thanks for your reply! I’ve tried implementing the built in cov_exp_quad function and tested its performance on small dataset. I found the computational time to be longer when built-in function was used compared to my previous formulation. I’m not sure why that’s the case but I’m suspecting that it is due to having to use it multiple times to define multiple length-scales (please see my function definition below for more detail). If there is a more efficient way to use this function when defining multiple length-scales, please let me know.

My problem has more than 3D so I don’t think I would be able to use the paper suggested. However, I think I might be able to make use of the Kronecker Product approach described here: https://sethrf.com/files/fast-hierarchical-GPs.pdf Since the paper is from 2017, is there a built-in Kronecker product implementation of Gaussian Processes that differs from that paper? I couldn’t find one but I just thought I would check.

Computational Time for previous implementation:

        warmup sample
chain:1 87.015 49.680
chain:2 92.540 48.463


Computational Time for built-in function:

         warmup sample
chain:1 133.363 75.504
chain:2 141.954 73.151

functions {
real c_alpha(real lambda){
real alpha_dem = sqrt(lambda);
real alpha = 1 / alpha_dem;
return alpha;
}

real c_rho(real beta){
real beta_dem = sqrt(2.0 * beta);
real rho = 1 / beta_dem;
return rho;
}

matrix c_sigma7(matrix X, real lambda, row_vector beta, int N){

matrix[N, N] sigma;

return(sigma);
}

matrix c_sigma2(matrix X, real lambda, row_vector beta, int N){

matrix[N, N] sigma;

return(sigma);
}
}

data {
int<lower=0> n; // number of field data
int<lower=0> m; // number of computer simulation
int<lower=0> p; // number of observable inputs x
int<lower=0> q; // number of calibration parameters t
vector[n] y; // field observations
vector[m] eta; // output of computer simulations
matrix[n, p] xf; // observable inputs corresponding to y
// (xc, tc): design points corresponding to eta
matrix[m, p] xc;
matrix[m, q] tc;
}

// Need to combine y (observations) and eta (simulations)
// into a single vector to establish statistical relation
// Chong & Menberg (2018), Hidgon et al. (2004)
transformed data {
int<lower=1> N;
vector[n+m] z; // z = [y, eta]
vector[n+m] mu; // mean vector

N = n + m;
// set mean vector to zero
for (i in 1:N) {
mu[i] = 0;
}
z = append_row(y, eta);
}

parameters {
// tf: calibration parameters
// rho_eta: reparameterization of beta_eta (the correlation parameters of the simulator)
// rho_delta: reparameterization of beta_delta (the correlation parameter of the model discrepancy)
// lambda_eta: precision parameter for eta
// lambda_delta: precision parameter for bias term
// lambda_e: precision parameter of observation error
row_vector<lower=0,upper=1>[q] tf;
row_vector<lower=0,upper=1>[p+q] rho_eta;
row_vector<lower=0,upper=1>[p] rho_delta;
real<lower=0> lambda_eta;
real<lower=0> lambda_delta;
real<lower=0> lambda_e;
}

transformed parameters {
// beta_delta: correlation parameter for bias term
// beta_eta: correlation parameter of observation error
row_vector[p+q] beta_eta;
row_vector[p] beta_delta;
beta_eta = -4.0 * log(rho_eta);
beta_delta = -4.0 * log(rho_delta);
}

// Create GP model based on the definitions from above
model {
// declare variables
matrix[N, (p+q)] xt;
matrix[N, N] sigma_eta; // simulator covariance
matrix[n, n] sigma_delta; // bias term covariance
matrix[N, N] sigma_z; // covariance matrix
matrix[N, N] L; // cholesky decomposition of covariance matrix
row_vector[p] temp_delta;
row_vector[p+q] temp_eta;

// field observation (xf) and calibration variables (tf)
// are placed together in a matrix with the
// computer observation (xc) and calibration variables (xc)
// xt = [[xf,tf],[xc,tc]]
xt[1:n, 1:p] = xf; // field observations
xt[(n+1):N, 1:p] = xc; // computer observations (assume to be the same as xf)
xt[1:n, (p+1):(p+q)] = rep_matrix(tf, n);
xt[(n+1):N, (p+1):(p+q)] = tc; // computer calibration variables

// computeation of covariance matrix sigma_eta
sigma_eta = c_sigma7(xt, lambda_eta, beta_eta, N);

// computeation of covariance matrix sigma_delta (bias)
sigma_delta = c_sigma2(xf, lambda_delta, beta_delta, n);

//for (j in 1:2){
//  print("u[", j, "] = ", sigma_delta[j]);
//}

// computation of covariance matrix sigma_z
sigma_z = sigma_eta;
sigma_z[1:n, 1:n] = sigma_eta[1:n, 1:n] + sigma_delta;

for (i in 1:n) {
sigma_z[i, i] = sigma_z[i, i] + (1.0 / lambda_e);
}

//print(sigma_z)

// Specify hyperparameters here
rho_eta[1:(p+q)] ~ beta(1.0, 0.3);
rho_delta[1:p] ~ beta(1.0, 0.3);
lambda_eta ~ gamma(10, 10); // gamma (shape, rate)
lambda_delta ~ gamma(10, 0.3);
lambda_e ~ gamma(10, 0.03);

// Specify priors here
tf[1] ~ weibull(2.724,0.417);
tf[2] ~ uniform(1.737e-05,1);
tf[3] ~ normal(0.472,0.140);
tf[4] ~ gamma(6.90965275109803,19.158);
tf[5] ~ lognormal(-1.983,0.640);

L = cholesky_decompose(sigma_z); // cholesky decomposition
z ~ multi_normal_cholesky(mu, L);
}


Yes, it’s the multiple calls and multiple matrix pointwise multiplications that are now making the autodiff tree big and slow.

@bbbales2 can you remind us about the status of vector length scale for cov_exp_quad?

If that’s not yet in (at least document doesn’t show it), you would get better performance by scaling columns of X first using diag_post_multiply and then calling cov_exp_quad once.

It’s possible that in this approach the autodiff part will again be dominating the computation time.

No.

1 Like

Non-existent, I’m afraid.

Yes do this. Something like:

diag_post_multiply(X, inv(beta))


I think inv is vectorized.

Edit: updated eq
Edit2: unupdated eq. Had it right the first time whoops

2 Likes

Thank you both for the help.

After looking at the definition of the cov_exp_quad I can’t see a version that may be applied to a matrix. In my case, X is a [m, p] matrix and therefore I think I would need to still apply cov_exp_quad. After applying diag_post_multiply(X, inv(beta)) on X, the output would still be a matrix which as far as I can tell, cov_exp_quad can’t handle. I’m probably missing something obvious here, but how could I deal with that?

Oh ouch we never doc’d the functions that do this apparently.

I think something like this should work, might have to do some debugging:

matrix c_sigma7(matrix X, real lambda, row_vector beta, int N) {
matrix[N, N] scaled_X = diag_post_multiply(X, inv(beta));
matrix[N, N] Sigma;

for(i in 1:N) {
Sigma[i, i] = -0.5 * square(lambda);
for(j in (i + 1):N) {
Sigma[i, j] = -0.5 * squared_distance(scaled_X[i,], scale_X[j,]);
Sigma[j, i] = Sigma[i, j];
}
}

return lambda * Sigma;
}


(Edit: forgot lambda)

2 Likes

Many thanks for clarifying! After a couple of changes (e.g. taking the exponential of the squared distance) I got it work. I compared it against my original attempt (original post) and it’s faster! I added the results below for reference.

If it’s not too much trouble, could you briefly explain to me why this has led to a decrease in computational cost, especially in the warmup, even though it’s relying on a for loop?

In addition, I’m using get_elapsed_time to compare the computational time. Is there something similar for memory use?

New Function:

  matrix c_sigma(matrix X, real lambda, row_vector beta, int N, int p, int q) {
matrix[N, (p+q)] scaled_X = diag_post_multiply(X, square(beta));
matrix[N, N] Sigma;
real lambda_inv = 1.0 / lambda;

for(i in 1:N) {
Sigma[i, i] = lambda_inv;
for(j in (i + 1):N) {
Sigma[i, j] = exp(-squared_distance(scaled_X[i,], scaled_X[j,])) * lambda_inv;
Sigma[j, i] = Sigma[i, j];
}
}

return Sigma;
}


Original Script:

         warmup sample
chain:1 121.344 67.152
chain:2 128.246 68.578


New Script (using diag_post_multiply):

        warmup sample
chain:1 72.105 41.812
chain:2 70.929 40.432

2 Likes

Hahaha, I guess this is why debugging code is important.

In terms of memory, the way to think about how automatic differentiation works in Stan, there is an extra little piece of information saved for every mathematical operation. That means every time you add two scalars together (a + b), a record of that operation is saved. For bigger operations, (like the squared_distance operator), we’ve coded this up so that information is small, but it’s still something.

If you’re building an NxN matrix, that’s at O(N^2) of these things. If you build 7 O(N^2) matrices, that’ll be 7x the cost. In this rewrite we put the 7x bit inside squared_distance, which will use less memory than if you broke it into its individual operations (sum(square(x_scaled[i, ] - x_scaled[j, ]))). So that’s the trick.

Anyway this back of the envelope stuff is all very inexact. @rok_cesnovar is working on some tools to debug this.

2 Likes

Hmm, it’s there in C++ and I think it’s in the language too, just not doc’d. Your comments on the docs (https://github.com/stan-dev/docs/pull/272) are good, I just haven’t gotten around to following up yet.

I thought about this but I’m just hesitant to tell people about functions that aren’t doc’d yet. Too confusing. It will hopefully soon be fixed.

Although inexact, it was useful to read - thank you for taking the time to explain this to me!

Thanks for letting me know!

1 Like

Oh yeah hey, follow up, I talked to someone and realized my comments about “Non-existent, I’m afraid” sounds really dismissive of the work you did. I didn’t mean it that way, my apologies (at some point @avehtari had asked me to look at a multiple length scale thing and I never did it – I was really thinking about that). We appreciate your contributions.

3 Likes

Hello, an update and a couple of question, if I may!

Drawing from the Flaxman paper, I implemented a kronecker structure within my model. I have also included one more predictor, a nugget and boundary avoiding priors (invgamma) which allow the model to run fast and converge without any problems (pair plots seem fine as well to my inexperienced eyes).

The problem I’m facing is that the model is overfitting the training data, resulting in poor out of sample predictions. The troubling part for me is that the posteriors of the lengthscales for some predictors are very small (where I wouldn’t expect them to be based on the prior predictive checks I run and my understanding of the simulator being emulated), while the nugget posterior mean is also very small. I tried regularising by using stronger priors (based on my understanding of the simulator being emulated, I expect monotonic relationships for most predictors) and it makes little to no difference. In an extreme example (the code is below), the posterior mean of the nugget was 0.02, when the probability of the prior being <= 0.02 was 10E-218.

Questions:

• Is my model somehow ignoring my priors? . Probably not the case but I’m lost as to why the posterior concentrates on values that highly improbable according to the prior
• Am I justified to add a fixed nugget instead of a parameter with a prior? This reduces the overfitting issue but I’m unsure whether that’s appropriate.

Notes:

• Model run on rstan 2.18.2 which is the only version available at my Universities HPC system
• There are 8 predictors scaled to be within [0, 1].
• I can’t increase the number of functional outputs (xc) but I can increase the number of simulation runs (i.e. the combinations of sampled simulator model inputs captured by tc) although this made no difference so far.
functions {
// return (A \otimes B) v where:
// A is n1 x n1, B = n2 x n2, V = n2 x n1 = reshape(v,n2,n1)
matrix kron_mvprod(matrix A, matrix B, matrix V) {
return transpose(A * transpose(B * V));
}
// A is a length n1 vector, B is a length n2 vector.
// Treating them as diagonal matrices, this calculates:
// v = (A \otimes B + sigma2)ˆ{-1}
// and returns the n1 x n2 matrix V = reshape(v,n1,n2)
matrix calculate_eigenvalues(vector A, vector B, int n1, int n2, real sigma2) {
matrix[n1,n2] e;
for(i in 1:n1) {
for(j in 1:n2) {
e[i,j] = (A[i] * B[j] + sigma2);
}
}
return(e);
}

matrix c_sigma(matrix X, real lambda, row_vector rho, int N, int p, int q) {
matrix[N, (p+q)] scaled_X = diag_post_multiply(X, inv(rho));
matrix[N, N] Sigma;
real lambda_inv = 1.0 / lambda;

for(i in 1:N) {
Sigma[i, i] = lambda_inv;
for(j in (i + 1):N) {
Sigma[i, j] = exp(-0.5*squared_distance(scaled_X[i,], scaled_X[j,])) * lambda_inv;
Sigma[j, i] = Sigma[i, j];
}
}

return Sigma;
}
}

// Try to perform reparametrisation of priors
data {
int<lower=0> m2; // number of functional output points xc
int<lower=0> m1; // number of computer simulation permutations tc
int<lower=0> p; // number of observable inputs x
int<lower=0> q; // number of calibration parameters t
matrix[m2, m1] eta_m; // output of computer simulations
matrix[m2, p] xc;
matrix[m1, q] tc;
}

parameters {

row_vector<lower=0>[q] rho_eta_tc;
row_vector<lower=0>[p] rho_eta_xc;
real<lower=0> lambda_eta;
//real<lower=0> lambda_sim;
real<lower=0> nugget;
}

// Create GP model based on the definitions from above
model {
// declare variables
matrix[m1, m1] sigma_eta_tc; // simulator covariance for calibration parameters
matrix[m2, m2] sigma_eta_xc; // simulator covariance for weather parameters
matrix[m1, m1] Q1;
matrix[m2, m2] Q2;
vector[m1] L1;
vector[m2] L2;
matrix[m1,m2] eigenvalues;

sigma_eta_tc = c_sigma(tc, lambda_eta, rho_eta_tc, m1, 0, q);
sigma_eta_xc = c_sigma(xc, 1, rho_eta_xc, m2, p, 0);

// Specify hyperparameters here
rho_eta_xc[1:(p)] ~ inv_gamma(15.0, 15.0);
rho_eta_tc[1:(q)] ~ inv_gamma(5.0, 5.0);
lambda_eta ~ gamma(10, 10); // gamma (shape, rate)
//lambda_sim ~ gamma(10,0.001); // inverse of nugget component
nugget ~ inv_gamma(1.0, 10.0);

Q1 = eigenvectors_sym(sigma_eta_tc); // tc eigenvector
Q2 = eigenvectors_sym(sigma_eta_xc); // xc eigenvector
L1 = eigenvalues_sym(sigma_eta_tc); // tc eigenvalues
L2 = eigenvalues_sym(sigma_eta_xc); // xc eigenvalues
eigenvalues = calculate_eigenvalues(L1,L2,m1,m2,nugget);

target += ( -0.5 * sum(eta_m .* kron_mvprod(Q1,Q2,
kron_mvprod(transpose(Q1),transpose(Q2),eta_m) ./ transpose(eigenvalues))) //Added transpose to eigenvalues myself
- .5 * sum(log(eigenvalues)));
}



The nugget is added in these situations to make up for what should be a positive definite matrix numerically ending up with negative or zero eigenvalues.

So it should just be a really small constant thing and not a parameter (like 1e-10, or whatever the smallest thing is so that the mathematically positive definite covariance matrix is actually positive definite when you do a decomposition).

If the model works without the nugget too, just do that.

If this still seems to overfit too much, then I guess where are the predictions vs. the training samples? Is anything going outside the training set?

How many dimenions is the emulator on input/output?

1 Like

Because I can’t train the emulator on all of the simulator inputs since there is too many of them, I selected the subset which the model output of interest is most sensitive too. Therefore, I do believe there is some noise in the training data that can’t be fully explained by the predictors alone (given the exact same values for the eight predictors, the output could be slightly different). Hence, I thought that treating it as a noisy GP is appropriate even if it’s an emulation problem. The noise is what I called a “nugget” above.

No I already checked and ensured that the validation data points are all within the training data point range. When trying to reproduce the training period, the trained model performs extremely well (e.g. R^2 > 0.96 and NMBE < 0.6 \% but then fails to reproduce the unseen validation period. I do think that the posterior values for the lengthscale for some of the predictors are too small, leading to high frequency relationships that physically seem counterintuitive - this is where I think the model tries to make up for the noise by overfitting.

For each output, the are 8 inputs (p = 3 and q = 5 in the stan script above).

Oh okay you’d want to estimate that then.

I guess that would mean that between data points the emulator reverts to its prior mean and doesn’t interpolate well and leads to bad predictions.

Can you make marginal predictions for your 8 inputs?

So for any single input, hold the other 7 at some value (maybe their average), and then make a plot of what happens on that input (and put data there too).

I’ll give this a try, thanks! Is this to identify whether indeed some of the lengthscales are inappropriately estimated?

With regards to the posterior of the noise term (and some of the lengthscales), isn’t it surprising to get most of the mass to lie at as such a low probability space according to the prior (probability of 10E-218 for the noise term, 10E-30 for some of the lengthscales)? I would think by using those priors, I’m telling the model that there is essentially no chance of the lengthscale/noise being that small, yet somehow the posterior still gathers there.

If the length scales are indeed getting too small then I expect what you’ll see is that at data points the GP is accurate and between them it reverts to the mean (and interpolates poorly).

Yeah, it’s telling us the data really wants to do something else. Those second numbers seem like densities not masses though (they’d be masses if you integrated over some volume), but I don’t think you need to compute anything exactly here (if the posterior mean is a few prior standard deviations away from the prior mean – that’s worth raising an eyebrow at, not that there are any hard and fast rules on N-standard deviations here).

(Edit: to -> too)

It might be good to then use monotonic functions. Monotonic functions are less likely to overfit.

You can also have a lower limit for a nugget if you have suitabel prior information about the measurement accuracy.

Sorry, I don’t have time to check the model in more detail