Could you please help to find out why my GP model got slower

Dear community,

Sorry for writing so long. Maybe this is not a GP question but rather due to my poor Stan coding ability. I am playing with GP following this great case study from @betanalpha https://betanalpha.github.io/assets/case_studies/gaussian_processes.html; stan manual 10.3 Fitting a Gaussian process | Stan User’s Guide and @avehtari’s HSGP paper [2004.11408] Practical Hilbert space approximate Bayesian Gaussian processes for probabilistic programming.

My practice is to estimate a 1D function f(x_{i}), i = 1,2..N_{f} which isconvolved (i.e., *) with a vector event_{j}, j = 1,2,...N_{obs}, so that the observed data y_{obs_{j}} = event *f + \sigma. Size N_{f} for function f is just 100, so not computationly expensive to model it using a exponentiated quadratic covariance function. Vector event_{j} has the same size as y_{obs_{j}}, N_{obs} is around 2500. Values in event_{j} is just binarized, so either 0 or 1. The apperaiance of 1 follows a known pattern (event[j==k]= 1; otherwise, event[j!=k] = 0, k is predefined). The idea is I simulated y_{obs} using y_{obs} = event *f + \sigma, with known f, event and \sigma. Then I’d like to model f by GP and estimate it from y_{obs} and event. Since I didn’t find a convolution fucntion that can be used inside Stan, I just toeplitzd vector event with an all 0 value vector that has the same size of f. This was done outside Stan code using Toeplitz function from the “pracma” package as following R code, the resultant matrix evt is just considered as data input to Stan model.

evt = t(Toeplitz(event, c(event[1], matrix(0, nrow = 1, ncol = length(f)-1))))

Then I can just do matrix multiplication of evt and f instead of convolution in the transformed parameter block. The model below (model #1) worked with good diagnostics and estimated well the shape of f. It took around 100s to finish the computation on my laptop. I can also mimic HSGP paper’s code (just for practicing to understand the method) to approximate the covariance function of f with m around 30 to speed up, although speed is not a big issue considering the size of f is just 100.


model #1

data{
    int<lower=1> N_obs;
    int<lower=1> N_f;
    int<lower=1> N_noise;
    vector[N_obs] y_obs;
    matrix[N_f,N_obs] evt;
    real x_f[N_f];
}

transformed data {
  real delta = 1e-10;
}

parameters {
  real<lower=0> rho;
  real<lower=0> alpha;
  real<lower=0> sigma_obs;
  vector[N_f] eta;
}

transformed parameters {
  vector[N_f] f;
  vector[N_obs] pred_obs; 
  {
    matrix[N_f, N_f] L_K;
    matrix[N_f, N_f] K = cov_exp_quad(x_f, alpha, rho);
    for (n in 1:N_f)
      K[n, n] = K[n, n] + delta;
    L_K = cholesky_decompose(K);
    f = L_K * eta ;
    pred_obs = evt' * f;
  }
}

model {
  //inv_gamma(9, 173); [10,50] , inv_gamma(4.6, 55.2); [5,50],  
  // inv_gamma(4.6, 22); [2,20],  inv_gamma(7.3, 75);; [5,30], inv_gamma(4.6, 33.1); [3,30],
  rho ~ inv_gamma(4.6, 33.1);
  alpha ~ normal(0, 2);
  sigma_obs ~ std_normal();
  eta ~ std_normal();
  y_obs ~ normal(pred_obs, sigma_obs); 
}

However, for some reason, I tried to take this event as the input data and create this Toeplitz matrix evt inside the Stan code as shown in model #2, but the computation time got almost 10 times longer with the same results estimated by model #1. Toeplitz function was copied from the same function in the “pracma” package. The input and output should be exactly the same as I did outside Stan code evt = t(Toeplitz(event, c(event[1], matrix(0, nrow = 1, ncol = length(f)-1)))). It also doesn’t involve any parameter in Toeplitz function, but just predefined values.

My question is do you think the reason for slowing down is the multiplication pred_obs = generate_evt(event, N_f, N_obs) * f called Toeplitz function so many times? I meant every draw for f will call this function once, although the function itself doesn’t involve any parameter to estimate. Is there any trick to only calculate evt once? I guess I should not put it in the transformed parameters block. However, the spectral density functions of different covariance functions are also called in transformed parameters in the birthday data basis_functions_approach_to_GP/CaseII_Birthday_Approximate-GP.stan at master · gabriuma/basis_functions_approach_to_GP · GitHub.

The reason why I’d like to do this Toeplitz function inside Stan is that in the next step I would simulate data with a new event vector in which event[k] = alpha[k] instead of event[k] = 1, then estimate both f and alpha[k] as parameters.


model #2

functions {
  matrix Toeplitz(vector a, vector b){
  int n = rows(a);
  int m = rows(b);
  matrix[n, m] T;
  T[1, ] = b';
  T[, 1] = a;
  for (i in 2:n) {
      T[i, 2:m] = T[i - 1, 1:(m - 1)];
  }
  return(T);
  }
  
  matrix generate_evt(vector event, int N1, int N2, int rec) {
  matrix[N2, N1] evt;
  evt = Toeplitz(event, append_row(event[1], rep_vector(0, N1-1)));
  return evt;
  }
}

data{
    int<lower=1> N_obs;
    int<lower=1> N_f;
    vector[N_obs] y_obs;
    vector[N_obs] event;
    real x_f[N_f];
}

transformed data {
  real delta = 1e-10;
}

parameters {
  real<lower=0> rho;
  real<lower=0> alpha;
  real<lower=0> sigma_obs;
  vector[N_f] eta;
  vector[N_noise] cof_noise;
}

transformed parameters {
  vector[N_f] f;
  vector[N_obs] pred_obs; 
  {
    matrix[N_f, N_f] L_K;
    matrix[N_f, N_f] K = cov_exp_quad(x_f, alpha, rho);

    // diagonal elements
    for (n in 1:N_f)
      K[n, n] = K[n, n] + delta;
    L_K = cholesky_decompose(K);
    f= L_K * eta ;
    pred_obs = generate_evt(event, N_f, N_obs) * f;
  }
}

model {
  //inv_gamma(9, 173); [10,50] , inv_gamma(4.6, 55.2); [5,50],  
  // inv_gamma(4.6, 22); [2,20],  inv_gamma(7.3, 75);; [5,30], inv_gamma(4.6, 33.1);; [3,30],
  rho ~ inv_gamma(4.6, 33.1);
  alpha ~ normal(0, 2);
  sigma_obs ~ std_normal();
  eta ~ std_normal();
  cof_noise ~ normal(0, 10);
  y_obs ~ normal(pred_obs, sigma_obs); 
}

Thanks a lot,
zcai

1 Like

Indeed it was due to my poor Stan programming ability. I found the instruction here 8.6 Program block: transformed parameters | Stan Reference Manual saying

Any variable that is defined wholly in terms of data or transformed data should be declared and defined in the transformed data block. Defining such quantities in the transformed parameters block is legal, but much less efficient than defining them as transformed data.

So I moved generate_evt(event, N_f, N_obs) from transformed parameters to transformed data and the speed is the same as if I would have given this consistent matrix as the input data.

4 Likes

Glad you got it sorted! Side note, we (Stan) should figure out how to do

diagonal(A) += delta;

to replace

    for (n in 1:N_f)
      K[n, n] = K[n, n] + delta;

Though we do allow

A += diag_matrix(rep_vector(N, 1.0));

though I think it will be a little slower than the first piece of code.

1 Like

Thanks, Steve. Robust Gaussian Process Modeling was adding this k(x_{i}, x_{i}) as follows,

  matrix[N_obs, N_obs] cov =   cov_exp_quad(x_obs, alpha, rho)
                             + diag_matrix(rep_vector(square(sigma), N_obs));

While I copied from Stan User’s Guide 10.3 Fitting a Gaussian process | Stan User’s Guide using

    // diagonal elements
    for (n1 in 1:N1)
      K[n1, n1] = K[n1, n1] + delta;

I would use add_diag

K = add_diag(K, delta);

we should update that to use add_diag

4 Likes

Thanks a lot, @stevebronder and @avehtari. Just for your reference, I tested both options with the same data and seed using one chain.

Using K = add_diag(K, delta); I got

Chain 1: Gradient evaluation took 0 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.

Chain 1: Elapsed Time: 258.85 seconds (Warm-up)
Chain 1: 297.712 seconds (Sampling)
Chain 1: 556.562 seconds (Total)

Using for (n in 1:N_f) K[n, n] = K[n, n] + delta;; I got

Chain 1: Gradient evaluation took 0.002 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 20 seconds.

Chain 1: Elapsed Time: 285.46 seconds (Warm-up)
Chain 1: 301.017 seconds (Sampling)
Chain 1: 586.477 seconds (Total)

1 Like

Hi @avehtari, may I ask approximately the computation time of the birthday data? basis_functions_approach_to_GP/CaseII_Birthday_Approximate-GP.stan at master · gabriuma/basis_functions_approach_to_GP · GitHub. In the paper, I just find the speed per iteration.

Another question if I may, do you think optimizing() would work for such a model with huge parameter number? I wonder if I can use it just to estimate the mean value of each parameter. Can I roughly say that the larger the parameter number and noiser the data are, the less accurate estimations optimizing() will provide?

Optimizing will try to find the MAP parameter estimate, which is not the same as the posterior mean.

Probably. The first problem is that the optimization problem is not convex, so you can’t be sure if the found local optimum is global. And I would think that when the number of parameters grows, it becomes harder and harder to find the global optimum (i.e. the actual MAP parameters).

1 Like

One thing to keep in mind with convolution models is that if the convolution is too diffuse then the latent functional behavior can be extremely degenerate – in other words there will be many functional behaviors compatible with any given observation, and the sampler will have to spend an inordinate amount of time exploring those configurations.

When investigating why a model is slow it helps to consider not only the total time (either for the entire model or just parts of the model using the new profiling tools) as well as the sampler behavior, in particular the total number of leapfrog steps and the distribution of the number of leapfrog steps per iteration. If there are a consistent, reasonably small (say less than 25) number of leapfrog steps per iteration then the slow down is probably due to the cost of a single gradient evaluation, but if the numerical trajectories are much longer then the number of gradient evaluations needed per iteration is probably more relevant. See for example some of the diagnostic investigations in Identity Crisis.

1 Like

Thanks a lot Michael for both replying and the nice case studies. I am not sure I get what you mean by “too diffuse”. If diffuse refers to how dense the 1s are distributed in vector event_{j}, j =1,2...N_{obs}, I agree with you that the latent function would capture many functional behaviours. In my simulation and the real applications, it is actually dense enough so that the distance between neighbours is shorter than the length of the latent function f. If these 1s are sparse, we would just extract the echoes and do an average to estimate f. The challenge is they overlay f along x-axis while convolution.

My observations from simulation results also proved your point at some level. Even if I configure 1s to be closer to each other and the distance is randomized to avoid the latent function capturing the periodic behaviour, the fitted GP will still capture some functional behaviour when I give the model a different event sequence that was not used to generate the observation. I understand GP will capture something anyhow, but I was expecting the posterior of marginal deviation (e.g. \alpha in your GP case study) in exponentiated quadratic function would reflect the uncertainty of the estimated latent function so that using the wrong event sequence would end up with larger \alpha. However, it is not the truth, \alpha is larger when giving the wrong sequence but not as large as I expect. The reason I guess so far is sigma_obs in data likelihood “absorbed” the uncertainty. To solve this I tried to use the square of sigma_obs in GP kernel instead of a nugget. This time the posterior of \alpha indeed reflected the uncertainty of f much better, but since I add sigma_obs to the diagonal of the Gram matrix, the estimated function from GP is not smooth anymore.

When the configuration of the latent phenomenology (here the gaussian process) and the measurement (here \sigma_{\text{obs}}) are fit jointly then they too can be degenerate. Indeed these are often the most subtle yet damaging degeneracies. This is especially true when the model is known to be wrong and the likelihood function contorts to capture all of the least-worst model configurations.

1 Like

In this case study
from seconds to minutes depending on the model (and it has faster code than the paper).

Optimizing can work if the number of parameters is much smaller than the number of observations. In the birthday case study optimizing gives quite good result for the simplest model, but not for the others.

2 Likes

Thanks, @avehtari, didn’t notice this case study, I will go through it.