How to make GP with non-Gaussian likelihood more efficient?

Thank you for your reply! I followed “Predictive Inference in non-Gaussian GPs” section in https://mc-stan.org/docs/2_19/stan-users-guide/fit-gp-section.html. I think in predictive inference, cov matrix in transformed parameters has been 1296*1296 dimensions?

Ah, but in the following “Analytical Form of Joint Predictive Inference” section, it shows a faster variant where you do the smaller observed covariance in the model and construct a larger prediction covariance in the generated quantities section.

Yes, I’m aware of that. But I thought it only applies for gaussian likelihood but does not apply for poisson likelihood, as the predictive posterior distribution does not have the closed form if the likelihood is poisson?

1 Like

Hm, you may be right that the analytic interpolation trick is only possible with Gaussian outcomes. This is now beyond my expertise; @avehtari, have any thoughts?

1 Like

I would use brms’s approximate gp (gpa) approach. It’s detailed here, and can be used for multidimensional GPs as well. It’s already in brms, so you can just do brms(y ~ gp(x1,x2,K=50), family = …). It is magical. It turns a 2 hour model into a 20 second model; an intractably large GP into something that takes 2-3 hours. The gist of it, is that it decomposes the RBF kernel function into eigenfunctions and eigenvalues. The larger the number of eigenfunctions used (K), the more accurate it is. I’ve found that even with 50, it tends to do very well. It’s far faster, nearly as accurate, converges better, and prediction is simpler. Brms, again, will handle the prediction side for you.

4 Likes

Many thanks for your suggestions! I’ll try this tutorial then.

Correct.

Stan will soon have possibility to use Laplace method to integrate over the latent values, which is usually very accurate with Poisson likelihood and faster than sampling the latent values.

brms has two different GP approximations. s(x1, bs="gp") and gp(x1, k=20). The first one is based on spline basis function presentation of Matern covariance function GP from mgcv package, and the latter is based on the basis function approximation via Laplace eigenfunctions (based on Solin & Särkkä, Hilbert space methods for reduced-rank Gaussian process regression and further implementation details will be described in a paper under preparation).

s(x1, bs="gp") works for 1D and 2D

gp(x1, k=20) works also for 3D and sometimes up to 4D given the approximated function is quite smooth. The number of basis functions is k^D.

The current exact GP in brms gp(x1) is not using the more efficient builtin covariance functions, but for small data can still be useful for more than 4 dimensions.

Given “given 1296 2-dimensional inputs x” I would use one of the basis function approaches.

@Stephen_Martin can you describe a bit more what kind of data and models you have had when using gp(x1, x2, k=20) in brms? We are interested learning success and failure stories before finishing the paper.

3 Likes

@avehtari I currently am using additive univariate kernels, currently (so, in brms terms: gp(x1) + gp(x2)) [there’s a reason for this]. I’ve been using generated data, and tested on a couple others. Typically, it’s std.normal predictors between N=300 and N=1000. I am currently applying it to latent variable models (e.g., in CFA/SEM). Specifically, I have a latent location variable for J indicators, and a corresponding latent scale variable for the J indicator’s residual variances. I have the latent location variable predicting the latent scale variable via an approximate gp.
This lets us have non-linear functions of reliability over latent space, ala IRT test information.

@Stephen_Martin have you noticed any strange or unexpected behavior?

@avehtari Nope. It may help that my input space is always similar, because inputs are always N(0,1), so my limits are just between -3*(5/2) and 3*(5/2).
Occasionally there are convergence problems (in terms of Rhats), but certainly less often than when using an exact GP.
What sort of behavior should I be on the lookout for?

1 Like

I’m looking for any behaviour which would bother you, but I’m happy if you haven’t seen any problems.

Hi @avehtari, Thank you for your reply! I have played with the basis function approximation and it works well! However,I want to further increase the number of 2D data points to 2500 or 3000. I ran the basis function approximation GP and in three days it only finishes 2000 iteration. I may need more iterations. Do you think there are any tricks in basis function approximation GP to further speed up the method when the number of data is large? Or what do people usually do when the number of data points is large when using GP? Many thanks in advance!

How many basis functions? Less basis functions will make it faster. How does your basis function model Stan code look like? You could combine basis functions with poisson_log_glm for at least 4x speedup. You might benefit from @bbbales2’s new adaptation and low rank mass matrix code, but that requires installing special version of CmdStan.

2 Likes

You can try out the new adaptation stuff with default cmdstan and rstan via: New adaptive warmup proposal (looking for feedback)!

Though the implementations of all that stuff are rather slow there :/. @xxy, if you try it, try it with a small model first, maybe 200~ params or less?

Thank you for your help! I used 40 basis functions and 1.5 boundary factor as I’m only sure my normalised length scale is greater than 1*2/35=0.05. I attached my Stan code here. gp.stan (2.6 KB) Do you have any examples which combines poisson_log_glm and basis function?

Thank you for your help! I’ll have a look at cmdstan first:)

Are you sure you need 40?

Ah, this is not optimal and can’t be directly switched to is poisson_log_glm. You would need to create the full design matrix instead.

You could try to generate one with brms make_stancode and use tensor splines instead of GP. For 2D and that many observations, Although the GP basis function approach you use can be slightly better as a prior, mgcv creates the full design matrix for you and then it’s trivial to use poisson_log_glm. The GP basis function approach you use, could also create that full design matrix, but is not just doing in the code you are using,

2 Likes

Remember there’s a problem with poisson_log_glm right now. If you use it, use the target += poisson_log_glm_pmf syntax not the ~ poinsson_log_glm syntax (https://github.com/stan-dev/math/issues/1498).

5 Likes
  1. How to select the number of basis function? I selected it by looking at Figure 6 (https://github.com/gabriuma/basis_functions_approach_to_GP/blob/master/Paper/manuscript/manuscript_06.pdf): I only knew the normalised length scale is greater than 0.05 and took c=1.5, so I used 40 basis functions, but I also tried 30 basis functions, and there are not much differences in terms of performance.

  2. I adapted the code to be ‘optimal’ and it do run fast than my previous code. Thanks! However, when I change poisson_log to poisson_log_glm, it becomes slower. Is it due to my additional term in my design matrix which changes over iterations? gp1.stan (4.1 KB)

  3. I found spline method s( , bs=‘gp’) is faster, but if I understand correctly, this does not estimate hyperparameters of the kernel and I need to fix them as input, right? Is it possible to use this method while estimating the hyper parameters?

  4. You mentioned New adaptive warmup proposal (looking for feedback)! However, I think it only helps with the warm-up period, isn’t it?

Sorry for my late reply as I just tried these options and thank you very much for your potential help!

2 Likes

append_col and append_row can be slow as they are creating new temporary variable every leapfrog step. It might be faster to create instead in transformed parameters a full size design matrix and a full size coefficient vector and assign constant and variable terms in the desired places. Then the matrix and coefficient are not created again each iteration and instead just values are assigned. @Bob_Carpenter might know better what would the best thing here.

It fixes the lengthscale but has magnitude as a parameter, which makes the posterior easier.

Eventually it will include adaptive mass matrix selection which improves speed also after the warmup.

2 Likes