Pre-computation of distance matrix in Gaussian Process?

I’ve been working on a project on-and-off about infering a latent time series. Since there are different “sensors” at work (binomial likelihood), I model potential group-specific effects as random intercepts. A really naive approach would be to fit using lme4::glmer with the model (1|group)+(1|measurement_time).

Since the latent time series is presumed to be smooth, I want instead to use a Gaussian Process prior to allow sharing of information across time. There are several R libraries that can fit this type of model, typically using the Laplace Approximation to the log marginal posterior.

Since my number of measurement times is actually relatively large, I am working with a Gaussian Predictive Process. Here are a few relevant papers (for now, I am not using the Finley correction):

What confuses me about all the relevant stan examples that I have seen is the computation of the kernel. Everytime cov_exp_quad is called, we end up rebuilding the distance matrix between all points. Is this not inefficient? Why not feed in the distance matrix as part of the data block and then call K=alpha^2 * exp(-0.5*D/rho) to build the kernel covariance matrix?

For reference, my below model (using the standard procedure) seems to fit very slowly. Perhaps this is actually a more difficult model than I assumed or there are some code inefficiencies? I have 1127 observations measuring the process on 324 different dates. There are also 55 “groups” and I am using 48 predictive “knots”.

parameters {
  vector[n_groups] bias;
  vector[n_knots] knot_tilde;
    
  real mu;
  real<lower=0> lambda;
  real<lower=0> tau;
  real<lower=0> rho;
}
transformed parameters {
  matrix[n_knots, n_knots] C_knot = add_diag(cov_exp_quad(z_knot, tau, rho), 1e-4);
  matrix[n_obs, n_knots] C_predict =   cov_exp_quad(z_predict, z_knot, tau, rho);
  matrix[n_knots, n_knots] L = cholesky_decompose(C_knot);
  vector[n_knots] knot = L * knot_tilde;
}
model {
  y ~ binomial_logit(n, mu + mdivide_right_spd(C_predict, C_knot) * knot + bias[group_index]);
  bias ~ normal(0, lambda);
  knot_tilde ~ normal(0, 1);
  rho ~ inv_gamma(5, 5);
  tau ~ std_normal();
  lambda ~ std_normal();
}

Any thoughts beside reducing the size of mdivide_right_spd(C_predict, C_knot) * knot by making it mdivide_right_spd(C_predict, C_knot)[date_index] * knot?

I’ve had the same thought, but every time I benchmark things (ex. here), it’s very hard to beat cov_exp_quad, which I only presume has some internal voodoo optimizations.

I did recently get something that was faster, but not dramatically so:

	matrix flat_cov_exp_quad(
		real eta
		, real rho
		, real jitter
		, int n_x
		, int square_n_x
		, int n_udx
		, vector udx
		, int[] i_uk_k_ut
		, int[] i_uk_k_lt
		, int[] i_k_diag
		, int[] i_k_ut
		, int[] i_k_lt
	){
		real sq_eta = square( eta ) ;
		vector[n_udx] uk = sq_eta * exp( udx / (-2*square(rho)) ) ;
		vector[square_n_x] k_flat ;
		k_flat[i_k_diag] = rep_vector(sq_eta + jitter,n_x) ;
		k_flat[i_k_ut] = uk[i_uk_k_ut] ;
		k_flat[i_k_lt] = uk[i_uk_k_lt] ;
		return to_matrix(k_flat,n_x,n_x) ;
	}

Where the R code for computing the inputs is:

dx = matrix(NA,n,n)
for(i in 1:n){
	for(j in 1:n){
		dx[i,j] = (x[i]-x[j])^2
	}
}

dx_ut_flat = as.numeric(dx[upper.tri(dx)])
udx = unique(dx_ut_flat)
dx_lt_flat = as.numeric(dx[lower.tri(dx)])

i_uk_k_ut = match(dx_ut_flat,udx)
i_uk_k_lt = match(dx_lt_flat,udx)
i_k_diag = which(as.numeric(diag(1,n,n))==1)
i_k_ut = which(as.numeric(upper.tri(matrix(NA,n,n)))==1)
i_k_lt = which(as.numeric(lower.tri(matrix(NA,n,n)))==1)
2 Likes

Usually rebuilding the matrix takes much less time than computing the Cholesky, so it really doesn’t matter if it’s rebuild.

It seems to be missing the data block?

For time series, you could consider Hilbert space basis function approximation of GP. The basis functions can be constructed in the transformed data block and during sampling only the spectral density values in the approximation need to be recomputed if lengthscale changes. See, code examples GitHub - gabriuma/basis_functions_approach_to_GP, Birthdays, and
Gaussian process demonstration with Stan

As the covariance matrix has many elements, there is a big difference whether it’s built in Stan language as then autodiff has NxN nodes in the expression tree, or whether it’s built in C++ and providing matrix derivatives for the autodiff so that the matrix is just one node in autodiff expression tree. There is some development going in Stan core that might someday make GPs with cov matrices defined in Stan language faster.

2 Likes

Thanks for the suggestions!

I’ve fine-tuned things a bit to make the matrix inversion more efficient and the code now runs in a reasonable amount of time using the default cov_exp_quad as opposed to @mike-lawrence’s proposal, although I wonder at what point this will be the case - I’m working on another paper with large n and multiple covariates going into the GP. The plan was again to use the predictive knot process but I will probably end up going with the EM algorithm just for the sake of speed.

@avehtari, I have encountered this Hilbert space approximation before but never studied it in depth - I will give it a good read.

Posting the full code here in case there are other obvious inefficiencies. My guess is that the slow sampling is due to stan taking many steps to explore the space (difficult model to fit) as opposed to the dataset size.

data {
  int n_obs;
  int n_groups;
  int n_knots;
  int n_dates;

  int n[n_obs];
  int y[n_obs];
    
  int index_date[n_obs];
  int index_group[n_obs];

  real z_knot[n_knots];
  real z_predict[n_dates];
}
parameters {
  vector[n_groups] bias;
  vector[n_knots] knot_tilde;
    
  real mu;
  real<lower=0> lambda;
  real<lower=0> tau;
  real<lower=0> rho;
}
transformed parameters {
  matrix[n_knots, n_knots] C_knot = add_diag(cov_exp_quad(z_knot, tau, rho), 1e-4);
  matrix[n_dates, n_knots] C_predict = cov_exp_quad(z_predict, z_knot, tau, rho);
  matrix[n_knots, n_knots] L = cholesky_decompose(C_knot);
  vector[n_knots] knot = L * knot_tilde;
  vector[n_dates] interp_knot = C_predict * mdivide_left_spd(C_knot,  knot);
}
model {
  y ~ binomial_logit(n, mu + interp_knot[index_date] + bias[index_group]);
  bias ~ normal(0, lambda);
  knot_tilde ~ normal(0, 1);
  rho ~ inv_gamma(5, 5);
  tau ~ std_normal();
  lambda ~ std_normal();
}

Thoughts on using mgcv::gam or mgcv::bam instead of being fully Bayesian in Stan? For large datasets mgcv seems to fit very quickly and have good performance for me. Obviously this is not as robust as a good MCMC and it doesn’t handle different lengthscales for “automatic relevance detection” but I am thinking mgcv can at least be a decent estimate or provide a proposal distribution for SIR.

1 Like