Seeking advice on optimizing a Gaussian Process (GP) prior to improve my model's convergence time

Hi everyone,

I am working on a model that combines a Dirichlet Process (DDP) with a Gaussian Process (GP) prior. I have used a chunk of code for the GP prior from the Stan User’s Guide. However, I suspect that this part of the code is causing the model to take a very long time to converge.

Here’s a simplified version of the relevant part of my Stan code:

functions {
  matrix L_cov_exp_quad_ARD(vector[] x,
                            real alpha,
                            vector rho,
                            real delta) {
    int N = size(x);
    matrix[N, N] K;
    real sq_alpha = square(alpha);
    for (i in 1:(N-1)) {
      K[i, i] = sq_alpha + delta;
      for (j in (i + 1):N) {
        K[i, j] = sq_alpha
                      * exp(-0.5 * dot_self((x[i] - x[j]) ./ rho));
        K[j, i] = K[i, j];
      }
    }
    K[N, N] = sq_alpha + delta;
    return cholesky_decompose(K);
  }
}
data {
  int<lower=1> N;
  int<lower=1> D;
  array[N] vector[D] x;
  vector[N] y;
}
transformed data {
  real delta = 1e-9;
}
parameters {
  vector<lower=0>[D] rho;
  real<lower=0> alpha;
  real<lower=0> sigma;
  vector[N] eta;
}
model {
  vector[N] f;
  {
    matrix[N, N] L_K = L_cov_exp_quad_ARD(x, alpha, rho, delta);
    f = L_K * eta;
  }

  rho ~ inv_gamma(5, 5);
  alpha ~ std_normal();
  sigma ~ std_normal();
  eta ~ std_normal();

  y ~ normal(f, sigma);
}

[edit: end code format block]


Issue:
The sampling process is extremely slow, and the model takes a very long time to converge.
Request:
Can anyone suggest how I might optimize this part of the code to improve convergence speed? Are there specific techniques or modifications that can make the GP prior more computationally efficient in Stan?

Thank you for your help!

Hi, I suppose that f and y are at the evaluated at the same input locations? If so you can use marginalization to obtain the posterior over the parameters (excluding f). That is if we have a GP:

P(f) = \mathcal{N}(f|0,K)
P(y|f) = \mathcal{N}(y|f,\sigma^2I)

We obtain the marginal as:

P(y) = \mathcal{N}(y|0,K+\sigma^2I)

This removes estimation of f from the sampling process. To estimate the posterior predictions of f we have:

P(f|y) = \mathcal{N}(f|K(K+\sigma^2I)^{-1}y,K-K(K+\sigma^2)^{-1}K)
1 Like

One of the things you can do is pre-store the dot_self terms dot_self(x[I] - x[j]) in a matrix in the transformed data block.

transformed data {
  vector[N] delta_vec = rep_vector(delta, N);
  matrix[N, N] dots;
  for (i in 1:N) {
    for (j in 1:N) {
      dots[i, j] = dot_self(x[i] - x[j]);
    }
  }
}

This does redundant work, but it’s not involving gradients and only done once on data load.

Then you can define the expensive inner loop this way:

K = add_diag(delta_vec, sq_alpha * exp((-0.5 / rho^2) * dots));

It can be made even more efficient if you keep dots lower triangular, use a triangular multiply, then add transpose and fix diagonal. Probably not worth it.

The real cost is going to be the Cholesky decompose, though, and there’s really no way around that in general code.

In some cases, as @Garren_Hermanus pointed out, you can do the analytic marginalization. With large N in low dimensions, you can also use other Fourier-transform-based tricks. But in general, GPs are expensive to compute.

You might want to relax the precision of the answer you want—I don’t know your ESS target, but usually 100 or fewer is fine for most inference.

1 Like

Thank you very much for your suggestion Dr Carpenter. I implemented your recommended changes, which did help to speed up my method to considerable extent (up to 8 times when n =500). However, when the number of observations is large it is still expensive.

Could you please provide further advice or additional suggestions to help optimize my method further? I added the whole code in the another comment. If you can have a look I will really appreciate it.

Thank you so much for the suggestion. Actually, in the actual code which I am providing here, I could not find a way to find the closed form the posterior distribution.
functions{

// Creating mean vector for mu(x)
matrix L_mean_vec(vectorx,vectorbeta)
{
int N=size(x);
int K=size(beta);
matrix[N,K] reg;
for(i in 1:N){
for(j in 1:K)
{
reg[i,j]=dot_product(x[i],beta[j]);
}
}
return reg;
}
}

data {
int<lower=0> K; // Number of cluster
int<lower=0> N; // Number of observations
int<lower=0> P; //dimension of covariates
real y[N] ; // response
vector[P] x[N];//covariate
vector[N] z; //treatment allocation

//Hyperparameters
real<lower=0> alpha_sigma;
real<lower=0> beta_sigma;
real<lower=0> sigma_beta;
real<lower=0> a_0;
real<lower=0> a_1;

}
transformed data{
real d_mu =1e-9;
real d_delta =1e-9;
vector[N] delta_mu=rep_vector(d_mu,N);
matrix[N,N] dots;
for(i in 1:N){
for(j in 1:N){
dots[i,j]=dot_self(x[i]-x[j]);
}
}
K_mu=add_diag(exp(-0.5*dots),delta_mu);
K=cholesky_decompose(K_mu);
}
parameters {
vector[N] mu[K];
vector[N] delta[K];

vector<lower=0,upper=1>[K - 1] v; // stickbreak components
real<lower=0> alpha; // hyper prior DP(alpha, base)
real<lower=0> sigma2; //Mixture normal variance
vector[P] beta[K];// mu(X) mean X^T %*% beta

}

transformed parameters {
simplex[K] eta;
vector<lower=0,upper=1>[K - 1] cumprod_one_minus_v;

cumprod_one_minus_v = exp(cumulative_sum(log1m(v)));
eta[1] = v[1];
eta[2:(K-1)] = v[2:(K-1)] .* cumprod_one_minus_v[1:(K-2)];
eta[K] = cumprod_one_minus_v[K - 1];
}

model {
real ps[K];
// Prior specification
alpha~ gamma(a_0,a_1);
sigma2~ inv_gamma(alpha_sigma,beta_sigma);
for(i in 1:K){
beta[i]~ multi_normal(rep_vector(0,P),diag_matrix(rep_vector(sigma_beta,P)));
}
v ~ beta(1, alpha);

// Generarting mu(X),delta(X)

for (i in 1:K){
mu[i]~multi_normal_cholesky(L_mean_vec(x,beta)[,i],K);
delta[i]~multi_normal_cholesky(rep_vector(0,N),K);
}

for(i in 1:N){
for(k in 1:K){
  ps[k] = log(eta[k]) + normal_lpdf(y[i]| mu[k][i]+z[i]*delta[k][i], sigma2);
}

target+=log_sum_exp(ps);
}

}

generated quantities {
real ll;
real ps_[K];

ll=0;
for(i in 1:N){
for(k in 1:K){
  ps_[k] = log(eta[k]) + normal_lpdf(y[i]| mu[k][i]+z[i]*delta[k][i],sigma2);
}
ll += log_sum_exp(ps_);

} }

You do not need the full closed form posterior. The natural extension of the marginalization, when we have additional parameters is as follows

Suppose we joint density:

P(y,f,\theta)=P(y|f,\theta)P(f|\theta)P(\theta)

Here \theta represents parameters of the GP, and may also include some noise parameter on the data. Upon marginalization of f we obtain the posterior:

P(\theta|y) \propto P(y|\theta)P(\theta)

Where P(y|\theta)=\int_{f} {P(y|f,\theta)P(f|\theta) df} = \mathcal{N}(y|0,K+\sigma^2I)

And the posterior predictions we obtain is conditional P(f|y,\theta), the same density as above, but K is some function of \theta.