Gaussian Process regression with excess of zeros

Hello everyone,

I am trying to do a Guassian Process Regression on a big dataset with only one input/output. This data has an excess of zeros. What I need from this regression are the fitted values and the predicted values.
When I do the GP regression using other methods like “gausspr” (in R package Kernlab) I get a “good” (not all zero) approximation of the model fitted values to the real values. However, when I try to do the GP regression in STAN, the fitted values are always zero or really close to zero (when the real maximum value of the data is around 50). The predicted values do not see so far off to the real data as the fitted values do.

I added a function in the code to compute the fitted values modifying a little bit the code in the manual. It pretty much do the same operations that are made in the models block, but this time I use them for the generated quantities block. I dont know if this is a correct method of doing it. Is there maybe a more straightforward way? I attach my model code and some data.

Fitandpred= "
functions {

matrix gp_fitted (vector y1, real[] x1,real alpha,
real rho, real sigma, real delta) {
int N1 = rows(y1);
matrix[N1,N1] k;
matrix[N1,N1] lk;
{k = cov_exp_quad(x1, alpha, rho);
for (n1 in 1:N1)
k[n1, n1] = k[n1, n1] + square(sigma);
lk = cholesky_decompose(k);
}
return lk;
}

vector gp_pred_rng(real[] x2, vector y1, real[] x1,
real alpha, real rho, real sigma, real delta) {
int N1 = rows(y1);
int N2 = size(x2);
vector[N2] f2;
{
  matrix[N1, N1] L_K;
  vector[N1] K_div_y1;
  matrix[N1, N2] k_x1_x2;
  matrix[N1, N2] v_pred;
  vector[N2] f2_mu;
  matrix[N2, N2] cov_f2;
  matrix[N2, N2] diag_delta;
  matrix[N1, N1] K;
  K = cov_exp_quad(x1, alpha, rho);
  for (n in 1:N1)
  K[n, n] = K[n,n] + square(sigma);
  L_K = cholesky_decompose(K);
  K_div_y1 = mdivide_left_tri_low(L_K, y1);
  K_div_y1 = mdivide_right_tri_low(K_div_y1',L_K)';
  k_x1_x2 = cov_exp_quad(x1, x2, alpha, rho);
  f2_mu = (k_x1_x2' * K_div_y1);
  v_pred = mdivide_left_tri_low(L_K, k_x1_x2);
  cov_f2 = cov_exp_quad(x2, alpha, rho) - v_pred' * v_pred;
  diag_delta = diag_matrix(rep_vector(delta,N2));
  f2 = multi_normal_rng(f2_mu, cov_f2 + diag_delta);
}
return f2; }
}

data {
int<lower=1> N1;
real x1[N1];
vector[N1] y1;
int<lower=1> N2;
real x2[N2];
}
transformed data {
vector[N1] mu = rep_vector(0, N1);
real delta = 1e-9;
}
parameters {
real<lower=0> rho;
real<lower=0> alpha;
real<lower=0> sigma;
} model {
matrix[N1, N1] L_K;
{
  matrix[N1, N1] K = cov_exp_quad(x1, alpha, rho);
  real sq_sigma = square(sigma);
  // diagonal elements
  for (n1 in 1:N1)
  K[n1, n1] = K[n1, n1] + sq_sigma;
  L_K = cholesky_decompose(K);
}
rho ~ inv_gamma(8, 30);
alpha ~ normal(0, 1);
sigma ~ normal(0, 10);
y1 ~ multi_normal_cholesky(mu, L_K);
}
generated quantities {
vector[N2] f2;
vector[N2] y2;
matrix[N1,N1] lk;
vector[N1] y_tilde;

f2 = gp_pred_rng(x2, y1, x1, alpha, rho, sigma, delta);
for (n2 in 1:N2)
y2[n2] = normal_rng(f2[n2], sigma);

lk = gp_fitted(y1, x1, alpha, rho, sigma, delta);
y_tilde = multi_normal_cholesky_rng(rep_vector(0, N1), lk);
}
"

To get the fitted values I use:

yfit<-get_posterior_mean(fit, pars ="y_tilde")

The results I get can be seen on the attached plot. I use the mean of the STAN chains in the plot. All the STAN fitted values (green) are zero or very close to zero while the other GP regression method seems to get some of the values right.

Does someone have some insight to fix this?

Thanks

gp.data.R (5.6 KB)

My first thought is to try and adjust the hyperparameters priors but apart from trial and error I do not know other option.

Is the data possibly count data? If so, you’ll want to model it explicitly as such, possibly including zero inflation as necessary.

1 Like

It could be count data yes. In some later point I want to try modelling it like it. However what bothers me right now is that with the other GP functions available in R I don’t have to specify a new distribution (count) and I still get some good approximations for the fitted values. There is something that I am doing wrong in the calculation of the fitted values or the model works differently from the other GP models in R, and I tend to think it is the former.

In Stan you’ll want to model all of the structure in the data that you know. A zero-inflated Poisson with the rate given by a Gaussian process is straightforward to build and fit in Stan.

Overall the reason you’re seeing differences is that your Stan program utilizes principled priors where as canned implementations often have no regularization at all. Another is that Stan goes to great lengths to explore all of the GP configurations consistent with the data whereas many canned implementations simply optimize the hyperparameters which typically overfits. See, for example, the series of case studies beginning with https://betanalpha.github.io/assets/case_studies/gp_part1/part1.html.

2 Likes

I am sorry but could you give me an example of how to model a zero inflated poisson model? I am lost there.

It may go against most of the data modelling practices but I am actually trying to overfit this model.

The manual has an example of both zero inflation and hurdle models for count data with extra zeros.