How to make GP with non-Gaussian likelihood more efficient?

Hello all,

I’m using rstan to make prediction of count data while tuning hyper parameters.

Data can be downloaded here
gp.pois.data.2d.R (14.5 KB)

Stan file can be run by

data <- read_rdump(“gp.pois.data.2d.R”)
pred_data <- list( N_observed=data$N, D=2, y_observed=data$y,
N_predict=data$N_predict, x_predict=data$x_predict,
observed_idx=data$sample_idx)

fit <- stan(file=‘poisson.stan’,
data=pred_data, seed=5838298,chain=1)

There are 26 2-dimensional observed inputs x and outputs y, and I want to estimate the posterior predictive distribution given 1296 2-dimensional inputs x. However, running Stan code overnight will stuck at 0% of iterations. Does anyone know how to make the estimation more quickly? Are there any techniques that I could possibly use? Many thanks in advance!

data {
  int<lower=1> N_predict;
  int<lower=1> D;
  vector[D] x_predict[N_predict];

  int<lower=1> N_observed;
  int<lower=1, upper=N_predict> observed_idx[N_observed];
  int y_observed[N_observed];
}

parameters {
  real<lower=0> rho;
  real<lower=0> alpha;
  vector[N_predict] f_tilde;
}

transformed parameters {
  vector[N_predict] log_f_predict;
{
  matrix[N_predict, N_predict] cov;
  matrix[N_predict, N_predict] L_cov;
  cov =  cov_exp_quad(x_predict, alpha, rho)
                     + diag_matrix(rep_vector(1e-10, N_predict));
  L_cov = cholesky_decompose(cov);
  log_f_predict = L_cov * f_tilde;
}
}

model {
  f_tilde ~ normal(0, 1);
  rho ~ inv_gamma(6.8589, 103.582);
  alpha ~ normal(0, 10);
  y_observed ~ poisson_log(log_f_predict[observed_idx]);
}

generated quantities {
  vector[N_predict] f_predict = exp(log_f_predict);
  vector[N_predict] y_predict;
  for (n in 1:N_predict)
    y_predict[n] = poisson_log_rng(log_f_predict[n]);
}
1 Like

It’s been a while since I last looked at how to do interpolation with GPs in Stan, but as I recall you need to model based on the dimensionality of the observed data (so, cov should be a 23x23 matrix), then in the generated quantities section you construct a larger predict_cov using the desired interpolation dimensionality.

Thank you for your reply! I followed “Predictive Inference in non-Gaussian GPs” section in https://mc-stan.org/docs/2_19/stan-users-guide/fit-gp-section.html. I think in predictive inference, cov matrix in transformed parameters has been 1296*1296 dimensions?

Ah, but in the following “Analytical Form of Joint Predictive Inference” section, it shows a faster variant where you do the smaller observed covariance in the model and construct a larger prediction covariance in the generated quantities section.

Yes, I’m aware of that. But I thought it only applies for gaussian likelihood but does not apply for poisson likelihood, as the predictive posterior distribution does not have the closed form if the likelihood is poisson?

1 Like

Hm, you may be right that the analytic interpolation trick is only possible with Gaussian outcomes. This is now beyond my expertise; @avehtari, have any thoughts?

1 Like

I would use brms’s approximate gp (gpa) approach. It’s detailed here, and can be used for multidimensional GPs as well. It’s already in brms, so you can just do brms(y ~ gp(x1,x2,K=50), family = …). It is magical. It turns a 2 hour model into a 20 second model; an intractably large GP into something that takes 2-3 hours. The gist of it, is that it decomposes the RBF kernel function into eigenfunctions and eigenvalues. The larger the number of eigenfunctions used (K), the more accurate it is. I’ve found that even with 50, it tends to do very well. It’s far faster, nearly as accurate, converges better, and prediction is simpler. Brms, again, will handle the prediction side for you.

4 Likes

Many thanks for your suggestions! I’ll try this tutorial then.

Correct.

Stan will soon have possibility to use Laplace method to integrate over the latent values, which is usually very accurate with Poisson likelihood and faster than sampling the latent values.

brms has two different GP approximations. s(x1, bs="gp") and gp(x1, k=20). The first one is based on spline basis function presentation of Matern covariance function GP from mgcv package, and the latter is based on the basis function approximation via Laplace eigenfunctions (based on Solin & Särkkä, Hilbert space methods for reduced-rank Gaussian process regression and further implementation details will be described in a paper under preparation).

s(x1, bs="gp") works for 1D and 2D

gp(x1, k=20) works also for 3D and sometimes up to 4D given the approximated function is quite smooth. The number of basis functions is k^D.

The current exact GP in brms gp(x1) is not using the more efficient builtin covariance functions, but for small data can still be useful for more than 4 dimensions.

Given “given 1296 2-dimensional inputs x” I would use one of the basis function approaches.

@Stephen_Martin can you describe a bit more what kind of data and models you have had when using gp(x1, x2, k=20) in brms? We are interested learning success and failure stories before finishing the paper.

3 Likes

@avehtari I currently am using additive univariate kernels, currently (so, in brms terms: gp(x1) + gp(x2)) [there’s a reason for this]. I’ve been using generated data, and tested on a couple others. Typically, it’s std.normal predictors between N=300 and N=1000. I am currently applying it to latent variable models (e.g., in CFA/SEM). Specifically, I have a latent location variable for J indicators, and a corresponding latent scale variable for the J indicator’s residual variances. I have the latent location variable predicting the latent scale variable via an approximate gp.
This lets us have non-linear functions of reliability over latent space, ala IRT test information.

@Stephen_Martin have you noticed any strange or unexpected behavior?

@avehtari Nope. It may help that my input space is always similar, because inputs are always N(0,1), so my limits are just between -3*(5/2) and 3*(5/2).
Occasionally there are convergence problems (in terms of Rhats), but certainly less often than when using an exact GP.
What sort of behavior should I be on the lookout for?

1 Like

I’m looking for any behaviour which would bother you, but I’m happy if you haven’t seen any problems.

Hi @avehtari, Thank you for your reply! I have played with the basis function approximation and it works well! However,I want to further increase the number of 2D data points to 2500 or 3000. I ran the basis function approximation GP and in three days it only finishes 2000 iteration. I may need more iterations. Do you think there are any tricks in basis function approximation GP to further speed up the method when the number of data is large? Or what do people usually do when the number of data points is large when using GP? Many thanks in advance!

How many basis functions? Less basis functions will make it faster. How does your basis function model Stan code look like? You could combine basis functions with poisson_log_glm for at least 4x speedup. You might benefit from @bbbales2’s new adaptation and low rank mass matrix code, but that requires installing special version of CmdStan.

2 Likes

You can try out the new adaptation stuff with default cmdstan and rstan via: New adaptive warmup proposal (looking for feedback)!

Though the implementations of all that stuff are rather slow there :/. @xxy, if you try it, try it with a small model first, maybe 200~ params or less?

Thank you for your help! I used 40 basis functions and 1.5 boundary factor as I’m only sure my normalised length scale is greater than 1*2/35=0.05. I attached my Stan code here. gp.stan (2.6 KB) Do you have any examples which combines poisson_log_glm and basis function?

Thank you for your help! I’ll have a look at cmdstan first:)

Are you sure you need 40?

Ah, this is not optimal and can’t be directly switched to is poisson_log_glm. You would need to create the full design matrix instead.

You could try to generate one with brms make_stancode and use tensor splines instead of GP. For 2D and that many observations, Although the GP basis function approach you use can be slightly better as a prior, mgcv creates the full design matrix for you and then it’s trivial to use poisson_log_glm. The GP basis function approach you use, could also create that full design matrix, but is not just doing in the code you are using,

2 Likes

Remember there’s a problem with poisson_log_glm right now. If you use it, use the target += poisson_log_glm_pmf syntax not the ~ poinsson_log_glm syntax (poisson_log_glm sampling statement fails to fit · Issue #1498 · stan-dev/math · GitHub).

5 Likes