Hello everyone,
I am trying to do a Guassian Process Regression on a big dataset with only one input/output. This data has an excess of zeros. What I need from this regression are the fitted values and the predicted values.
When I do the GP regression using other methods like “gausspr” (in R package Kernlab) I get a “good” (not all zero) approximation of the model fitted values to the real values. However, when I try to do the GP regression in STAN, the fitted values are always zero or really close to zero (when the real maximum value of the data is around 50). The predicted values do not see so far off to the real data as the fitted values do.
I added a function in the code to compute the fitted values modifying a little bit the code in the manual. It pretty much do the same operations that are made in the models block, but this time I use them for the generated quantities block. I dont know if this is a correct method of doing it. Is there maybe a more straightforward way? I attach my model code and some data.
Fitandpred= "
functions {
matrix gp_fitted (vector y1, real[] x1,real alpha,
real rho, real sigma, real delta) {
int N1 = rows(y1);
matrix[N1,N1] k;
matrix[N1,N1] lk;
{k = cov_exp_quad(x1, alpha, rho);
for (n1 in 1:N1)
k[n1, n1] = k[n1, n1] + square(sigma);
lk = cholesky_decompose(k);
}
return lk;
}
vector gp_pred_rng(real[] x2, vector y1, real[] x1,
real alpha, real rho, real sigma, real delta) {
int N1 = rows(y1);
int N2 = size(x2);
vector[N2] f2;
{
matrix[N1, N1] L_K;
vector[N1] K_div_y1;
matrix[N1, N2] k_x1_x2;
matrix[N1, N2] v_pred;
vector[N2] f2_mu;
matrix[N2, N2] cov_f2;
matrix[N2, N2] diag_delta;
matrix[N1, N1] K;
K = cov_exp_quad(x1, alpha, rho);
for (n in 1:N1)
K[n, n] = K[n,n] + square(sigma);
L_K = cholesky_decompose(K);
K_div_y1 = mdivide_left_tri_low(L_K, y1);
K_div_y1 = mdivide_right_tri_low(K_div_y1',L_K)';
k_x1_x2 = cov_exp_quad(x1, x2, alpha, rho);
f2_mu = (k_x1_x2' * K_div_y1);
v_pred = mdivide_left_tri_low(L_K, k_x1_x2);
cov_f2 = cov_exp_quad(x2, alpha, rho) - v_pred' * v_pred;
diag_delta = diag_matrix(rep_vector(delta,N2));
f2 = multi_normal_rng(f2_mu, cov_f2 + diag_delta);
}
return f2; }
}
data {
int<lower=1> N1;
real x1[N1];
vector[N1] y1;
int<lower=1> N2;
real x2[N2];
}
transformed data {
vector[N1] mu = rep_vector(0, N1);
real delta = 1e-9;
}
parameters {
real<lower=0> rho;
real<lower=0> alpha;
real<lower=0> sigma;
} model {
matrix[N1, N1] L_K;
{
matrix[N1, N1] K = cov_exp_quad(x1, alpha, rho);
real sq_sigma = square(sigma);
// diagonal elements
for (n1 in 1:N1)
K[n1, n1] = K[n1, n1] + sq_sigma;
L_K = cholesky_decompose(K);
}
rho ~ inv_gamma(8, 30);
alpha ~ normal(0, 1);
sigma ~ normal(0, 10);
y1 ~ multi_normal_cholesky(mu, L_K);
}
generated quantities {
vector[N2] f2;
vector[N2] y2;
matrix[N1,N1] lk;
vector[N1] y_tilde;
f2 = gp_pred_rng(x2, y1, x1, alpha, rho, sigma, delta);
for (n2 in 1:N2)
y2[n2] = normal_rng(f2[n2], sigma);
lk = gp_fitted(y1, x1, alpha, rho, sigma, delta);
y_tilde = multi_normal_cholesky_rng(rep_vector(0, N1), lk);
}
"
To get the fitted values I use:
yfit<-get_posterior_mean(fit, pars ="y_tilde")
The results I get can be seen on the attached plot. I use the mean of the STAN chains in the plot. All the STAN fitted values (green) are zero or very close to zero while the other GP regression method seems to get some of the values right.
Does someone have some insight to fix this?
Thanks
gp.data.R (5.6 KB)
My first thought is to try and adjust the hyperparameters priors but apart from trial and error I do not know other option.