Hilbert space approximate GP prior: why use de-meaned basis functions (PHI)?

Hi All,

I have been reading Aki’s GP demonstration with Stan (Motorcycle data) and the corresponding paper on Hilbert space approximate GPs. In the Motorcycle demonstration, Aki de-means the columns of the PHI matrix (basis functions).

Compared to non-centered basis functions, the prior with centered basis functions seems to have lower variance toward the center of the domain. As far as I can tell, non-centered basis functions are more representative of the exact GP here (constant variance). My question is: why are the columns of PHI de-meaned? Is this best practice or a quirk of this case study?

I have included Stan and R code below to draw realisations from the approximate GP prior under both cases.

gb_question.stan:

functions {
  // Spectral densities.
  vector diagSPD_EQ(real alpha, real rho, real L, int M) {
    vector[M] j = linspaced_vector(M, 1, M);
    vector[M] lambda = (j * pi() / (2*L))^2;
    vector[M] s = (alpha^2) * sqrt(2*pi()) * rho * exp(-0.5 * (rho^2) * lambda);
    return sqrt(s);
  }
  // Basis functions [N x M]
  matrix PHI(int N, int M, real L, vector x, int demean) {
    vector[M] j = linspaced_vector(M, 1, M);
    vector[M] sqrt_lambda = j * pi() / (2 * L);
    matrix[N, M] x_L = rep_matrix(x + L, M);
    matrix[N, M] phi = sin(diag_post_multiply(x_L, sqrt_lambda)) / sqrt(L);
    if (demean)
      for (m in 1:M) phi[, m] = phi[, m] - mean(phi[, m]);
    return phi;
    }
}

data {
  int<lower=1> N;      // number of observations
  vector[N] x;         // univariate covariate
        
  real<lower=0> c_f;   // boundary factor
  int<lower=1> M_f;    // number of basis functions
  real<lower=0> lengthscale_f; // lengthscale of f
  real<lower=0> sigma_f;       // scale of f
  int<lower = 0, upper = 1> center_phi; // De-mean basis functions?
}

transformed data {
  // Normalize data
  real xmean = mean(x);
  real xsd = sd(x);
  vector[N] xn = (x - xmean)/xsd;
  // Boundary value
  real L_f = c_f * max(xn);
  // Basis functions for f
  matrix[N,M_f] PHI_f = PHI(N, M_f, L_f, xn, center_phi);
  // spectral densities
  vector[M_f] diagSPD_f = diagSPD_EQ(sigma_f, lengthscale_f, L_f, M_f);
}

generated quantities {
  // Random draws from f
  vector[N] f =  PHI_f * (diagSPD_f .* to_vector(normal_rng(0, rep_vector(1, M_f))));
}

R code calling Stan:

fileq <- "./output/gb_question.stan"
# Compile model
modq <- cmdstan_model(stan_file = fileq, include_paths = ".")

# Input data
dat <- list(x = seq(0,1,length.out=100), 
            N = 100,
            c_f = 1.5, 
            M_f = 40,
            sigma_f = 1,
            lengthscale_f = 1, 
            center_phi = 1 # bool for centering basis functions
            )

# Centered phi
s_c <- modq$sample(data=dat, fixed_param=TRUE, 
                   iter_warmup = 0, iter_sampling=1000)

# Non-centered phi
dat$center_phi <- 0
s_nc <- modq$sample(data=dat, fixed_param=TRUE, 
                   iter_warmup = 0, iter_sampling=1000)

R code to visualise 90% density intervals.

# Plotting helper
f_q <- function(s, x=xn, id="x") {
  f <- s$draws('f')
  fmat <- matrix(0, nrow=dim(f)[1], ncol=dim(f)[3])
  for (i in 1:dim(f)[1]) fmat[i, ] <- as.numeric(f[i, 1, ])
  u <- apply(fmat, 2, quantile, prob=0.95)
  l <- apply(fmat, 2, quantile, prob=0.05)
  data.frame(x = x, u=u, l=l, phi=id)
}

# Plotting
xn <- as.numeric(scale(dat$x))
dq <- rbind(f_q(s_c, id="centered bf"), f_q(s_nc, id="non-centered bf"))
dq %>% 
  ggplot(aes(x=x, ymin=l, ymax=u, lty=phi, col=phi, fill=phi)) + 
  geom_ribbon(lwd=0.8, alpha = 0.3) + 
  theme_classic() + 
  labs(title = "90% Coverage Interval", y = "f") + 
  theme(legend.text = element_text(size=12), 
        legend.position = 'top', 
        legend.title = element_blank()
        )

Any advice greatly appreciated. I’m guessing I have missed something simple.

I don’t seem to be able to delete or edit this post (not enough street cred), so I’m replying with what seems to be the answer:

The basis functions are made to have zero means to prevent the otherwise strong correlation between the GP intercept (if used) and the first basis function.

This is explained in the Birthdays workflow document of Gelman, Vehtari, Simpson, et al.

2 Likes

Thanks for responding back and I prefer that you responded as you did with the solution. I went ahead and marked your answer as the solution so it shows up in your original post.

3 Likes

I still dont get why it is OK to do the zero meaning, as it changes the model (and your results as you see) so that it is no longer an approximation to the GP, right?

2 Likes

I have to say, I’ve also found this ad-hoc solution fishier than removing the intercept term (in the birthday case study).

Thanks, @marty, for the clear demonstration of the issue.

You correctly found my reasoning, but as demonstrated by your code, I made a mistake of ignoring the effect in the variance. The issue is complicated, as I did empirical centering within the box. The reduction of the posterior correlation was not originally discussed by Solin & Särkkä, but I’ll ask Solin next week if he has thought about it. I assume there is a way to reduce the correlation while keeping the variance correct, but I don’t have time to look at this before next week.

I removed the solution mark, as we don’t have yet solution what would be the correct way to reduce the posterior correlation.

Removing the intercept term removes just one correlating variable, but half of the basis functions are still correlating, and thus removing the intercept term is not a complete solution. In birthday example removing the intercept term is “ad hoc” itself, and possible because the data has been empirically centered. In both motorcycle and birthday example, when GP is used for a scale parameter, such centering is more difficult (especially in the birthday where the global scale fort hta component is unknown as there are many components). Thus having zero mean basis functions with correct prior behavior would be beneficial.

Although the centering of the basis functions did change the prior, in bot birthday and motorcycle examples, likelihood is that informative that the change in the prior is unlikely to be detectable. But of course I need to fix these examples to be correct.

1 Like

Sure, I just meant that mathematically it’s relatively clear what you do to the model if you remove the intercept term, while it’s not clear what happens if you modify the basis functions, especially as you are adding something which does not play well with the original derivation of the basis function expansion (something about eigenfunctions of the Laplace operator with zero dirichlet boundary conditions). I suspected something as demonstrated above would happen.

1 Like

Thank you very much for the clarification.

I have now removed the centering from the Motorcycle and Birthdays case studies. Thanks again @Marty for reporting the error with great clarity.

I guess I’ll ask my new soon starting postdoc @Niko to look at more carefully possibilities of improving the posterior sampling speed when using basis functions :-)

2 Likes

Perfect, I already have a few questions that are waiting for to be answered (by myself I think).

2 Likes