Hilbert space approximate GP prior: why use de-meaned basis functions (PHI)?

Marty · January 2, 2022, 11:01pm

Hi All,

I have been reading Aki’s GP demonstration with Stan (Motorcycle data) and the corresponding paper on Hilbert space approximate GPs. In the Motorcycle demonstration, Aki de-means the columns of the PHI matrix (basis functions).

Compared to non-centered basis functions, the prior with centered basis functions seems to have lower variance toward the center of the domain. As far as I can tell, non-centered basis functions are more representative of the exact GP here (constant variance). My question is: why are the columns of PHI de-meaned? Is this best practice or a quirk of this case study?

I have included Stan and R code below to draw realisations from the approximate GP prior under both cases.

gb_question.stan:

functions {
  // Spectral densities.
  vector diagSPD_EQ(real alpha, real rho, real L, int M) {
    vector[M] j = linspaced_vector(M, 1, M);
    vector[M] lambda = (j * pi() / (2*L))^2;
    vector[M] s = (alpha^2) * sqrt(2*pi()) * rho * exp(-0.5 * (rho^2) * lambda);
    return sqrt(s);
  }
  // Basis functions [N x M]
  matrix PHI(int N, int M, real L, vector x, int demean) {
    vector[M] j = linspaced_vector(M, 1, M);
    vector[M] sqrt_lambda = j * pi() / (2 * L);
    matrix[N, M] x_L = rep_matrix(x + L, M);
    matrix[N, M] phi = sin(diag_post_multiply(x_L, sqrt_lambda)) / sqrt(L);
    if (demean)
      for (m in 1:M) phi[, m] = phi[, m] - mean(phi[, m]);
    return phi;
    }
}

data {
  int<lower=1> N;      // number of observations
  vector[N] x;         // univariate covariate
        
  real<lower=0> c_f;   // boundary factor
  int<lower=1> M_f;    // number of basis functions
  real<lower=0> lengthscale_f; // lengthscale of f
  real<lower=0> sigma_f;       // scale of f
  int<lower = 0, upper = 1> center_phi; // De-mean basis functions?
}

transformed data {
  // Normalize data
  real xmean = mean(x);
  real xsd = sd(x);
  vector[N] xn = (x - xmean)/xsd;
  // Boundary value
  real L_f = c_f * max(xn);
  // Basis functions for f
  matrix[N,M_f] PHI_f = PHI(N, M_f, L_f, xn, center_phi);
  // spectral densities
  vector[M_f] diagSPD_f = diagSPD_EQ(sigma_f, lengthscale_f, L_f, M_f);
}

generated quantities {
  // Random draws from f
  vector[N] f =  PHI_f * (diagSPD_f .* to_vector(normal_rng(0, rep_vector(1, M_f))));
}

R code calling Stan:

fileq <- "./output/gb_question.stan"
# Compile model
modq <- cmdstan_model(stan_file = fileq, include_paths = ".")

# Input data
dat <- list(x = seq(0,1,length.out=100), 
            N = 100,
            c_f = 1.5, 
            M_f = 40,
            sigma_f = 1,
            lengthscale_f = 1, 
            center_phi = 1 # bool for centering basis functions
            )

# Centered phi
s_c <- modq$sample(data=dat, fixed_param=TRUE, 
                   iter_warmup = 0, iter_sampling=1000)

# Non-centered phi
dat$center_phi <- 0
s_nc <- modq$sample(data=dat, fixed_param=TRUE, 
                   iter_warmup = 0, iter_sampling=1000)

R code to visualise 90% density intervals.

# Plotting helper
f_q <- function(s, x=xn, id="x") {
  f <- s$draws('f')
  fmat <- matrix(0, nrow=dim(f)[1], ncol=dim(f)[3])
  for (i in 1:dim(f)[1]) fmat[i, ] <- as.numeric(f[i, 1, ])
  u <- apply(fmat, 2, quantile, prob=0.95)
  l <- apply(fmat, 2, quantile, prob=0.05)
  data.frame(x = x, u=u, l=l, phi=id)
}

# Plotting
xn <- as.numeric(scale(dat$x))
dq <- rbind(f_q(s_c, id="centered bf"), f_q(s_nc, id="non-centered bf"))
dq %>% 
  ggplot(aes(x=x, ymin=l, ymax=u, lty=phi, col=phi, fill=phi)) + 
  geom_ribbon(lwd=0.8, alpha = 0.3) + 
  theme_classic() + 
  labs(title = "90% Coverage Interval", y = "f") + 
  theme(legend.text = element_text(size=12), 
        legend.position = 'top', 
        legend.title = element_blank()
        )

Any advice greatly appreciated. I’m guessing I have missed something simple.

Marty · January 4, 2022, 4:40am

I don’t seem to be able to delete or edit this post (not enough street cred), so I’m replying with what seems to be the answer:

The basis functions are made to have zero means to prevent the otherwise strong correlation between the GP intercept (if used) and the first basis function.

This is explained in the Birthdays workflow document of Gelman, Vehtari, Simpson, et al.

spinkney · January 4, 2022, 9:42am

Thanks for responding back and I prefer that you responded as you did with the solution. I went ahead and marked your answer as the solution so it shows up in your original post.

jtimonen · January 4, 2022, 1:50pm

I still dont get why it is OK to do the zero meaning, as it changes the model (and your results as you see) so that it is no longer an approximation to the GP, right?

Niko · January 4, 2022, 3:51pm

I have to say, I’ve also found this ad-hoc solution fishier than removing the intercept term (in the birthday case study).

avehtari · January 4, 2022, 7:08pm

Thanks, @marty, for the clear demonstration of the issue.

You correctly found my reasoning, but as demonstrated by your code, I made a mistake of ignoring the effect in the variance. The issue is complicated, as I did empirical centering within the box. The reduction of the posterior correlation was not originally discussed by Solin & Särkkä, but I’ll ask Solin next week if he has thought about it. I assume there is a way to reduce the correlation while keeping the variance correct, but I don’t have time to look at this before next week.

I removed the solution mark, as we don’t have yet solution what would be the correct way to reduce the posterior correlation.

Removing the intercept term removes just one correlating variable, but half of the basis functions are still correlating, and thus removing the intercept term is not a complete solution. In birthday example removing the intercept term is “ad hoc” itself, and possible because the data has been empirically centered. In both motorcycle and birthday example, when GP is used for a scale parameter, such centering is more difficult (especially in the birthday where the global scale fort hta component is unknown as there are many components). Thus having zero mean basis functions with correct prior behavior would be beneficial.

Although the centering of the basis functions did change the prior, in bot birthday and motorcycle examples, likelihood is that informative that the change in the prior is unlikely to be detectable. But of course I need to fix these examples to be correct.

Niko · January 4, 2022, 7:38pm

Sure, I just meant that mathematically it’s relatively clear what you do to the model if you remove the intercept term, while it’s not clear what happens if you modify the basis functions, especially as you are adding something which does not play well with the original derivation of the basis function expansion (something about eigenfunctions of the Laplace operator with zero dirichlet boundary conditions). I suspected something as demonstrated above would happen.

Marty · January 4, 2022, 8:38pm

Thank you very much for the clarification.

avehtari · January 5, 2022, 2:33pm

I have now removed the centering from the Motorcycle and Birthdays case studies. Thanks again @Marty for reporting the error with great clarity.

I guess I’ll ask my new soon starting postdoc @Niko to look at more carefully possibilities of improving the posterior sampling speed when using basis functions :-)

Niko · January 5, 2022, 2:59pm

Perfect, I already have a few questions that are waiting for to be answered (by myself I think).

JudithBouman2412 · October 12, 2022, 2:14pm

hello @Niko,

Do you have a solution for this already? I am currently dealing with a similar issue.

Would be great to hear from you!
Best regards,
Judith

Niko · October 12, 2022, 4:45pm

Oh, yes I think we had a solution. I can probably have a look tomorrow.

Niko · October 13, 2022, 6:55am

@JudithBouman2412 I’m working from memory here, but I think what follows should be correct:

In the end, we didn’t follow through on the above problems, in part because other things took priority, and in part because the potential solution didn’t necessarily seem to be “better” than the problem.

First, you should absolutely not just demean the basis functions.
Whether or not you can or should remove the intercept term will depend on the problem you are trying to model. “Removing” the intercept implicitly sets an point mass prior on zero, whether or not that’s appropriate… depends🤷.
If you want to keep the intercept term, but you have issues with sampling due to posterior dependencies and if you want to get rid of these dependencies, then you can in principle do that. However, if I remember correctly, this is where things get potentially a bit hairy. If I remember correctly, I wouldn’t necessarily recommend doing that.

I can try to dig up my experiments to see whether they are relevant to your use case (and whether I remember correctly), but I think some additional information about your model and or issues would be needed.

JudithBouman2412 · October 13, 2022, 7:10am

Thanks so much for your reply Niko!

This might explain why I am having issues sampling from my model, as I am currently demeaning the basis functions.

However, I also have to consider some possible issues with my prior and a smoothing algorithm I am using. So I will dig into it a bit more and might ask you again with more detail when I nailed down the exact issue.

Thanks again for being so helpful!

Niko · October 13, 2022, 7:40am

@JudithBouman2412

Okay, so I’ve found some (potentially pseudo-) code which I thought would recover the correct sampling behavior if you use demeaned basis functions. If you demean the basis functions, you have to change the prior on the intercept such that it takes into account that which you removed from the basis functions.

If you start with an intercept ~ normal(0,1), after demeaning of the basis functions you would have to modify the intercept prior to read

intercept ~ normal(
    dot_product(
        rowwise_mean(raw_PHI_f), scaled_basis_function_weights
    ),
    1
);

where raw_PHI_f are the non-demeaned basis functions and scaled_basis_function_weights are the coefficients with which you multiply your (demeaned) basis functions.

However, AFAIK, doing the above will introduce some nasty dependencies into the prior. Whether these dependencies carry over to the posterior and frustrate sampling will depend on your model+data.

It could be that doing the above fixes your issues, but it might not do so, and/or it might introduce additional ones.

Edit: The above being said, I can still imagine several other issues which can lead to slow sampling. I might be able to help, or at least tell you what you might have to change, but any advice has to be depend on the posterior.

Topic		Replies	Views
Hilbert space approximate GP prior: QR decomposition of the basis functions Modeling techniques	1	212	March 20, 2024
Practical Hilbert space approximate Bayesian Gaussian processes for probabilistic programming Publicity	20	2324	December 1, 2024
Sparse approximation for gaussian process Modeling	1	1083	July 24, 2020
Hilbert space approximate GP: generalization to multidimesional GPs Modeling	2	473	October 20, 2023
Noiseless Hilbert Space Gaussian Process? Modeling specification	1	401	March 20, 2023

Hilbert space approximate GP prior: why use de-meaned basis functions (PHI)?

Related topics