Identifiability of Simple Gaussian Process

Hi All,

Context

I have been strugling to fit a simple GP model and have some questions about
identifiability. In my actual use-case, I wish to model a time-varying regression
coefficient as a GP:

y_t \sim N(\mu_t, \sigma)

\mu_t = \beta_t \cdot X_t

\beta_t \sim GP(m,\ K(t,t'))

Model
In the simplified model here, there is no covariate X, i.e. we almost
observe \beta_t directly, with a small amount of noise:

y_t \sim N(\beta_t,\ \sigma)

Here is the true (simulated) \beta (black) and the observation y (blue):

The modelled GP is an exponentiated quadratic, and I am using a Hilbert space / basis function
approximation with 20 basis functions (the results do not seem sensitive to this choice for the ranges I have tried).

My priors for \alpha (gp_alpha) and \rho (gp_rho) are gamma distributed, such that 98% of the prior probability is between 0.005 and 0.8 for \rho, and between 0.05 and 0.8 for \alpha. Here are some realisations from the prior on \beta.

I understand that the posterior is degenerate for values of \rho outside the range of distances observed in the data.

Issues

  • Every sample exceeds the default maximum treedepth.
  • I needed very strong priors on both \alpha and \rho just to avoid divergences. For example, if my prior on \alpha is slightly relaxed to have 98% probability between 0.05 and 1.0, instead of 0.05 and 0.8, I get divergences.

Here is the pairs plot for gp_alpha - gp_rho:

And here is the pairs plot using the slightly wider prior on \alpha:

Questions

  • Is there something obviously wrong with the way I have parameterised the model?
  • Does the pairs plot suggest that there is a large range of \alpha-\rho combinations that are consistent with the data?
  • Am I exceeding maximum tree-depth because the posterior is long and narrow over the weakly-identified \alpha-\rho region, and step size has been calibrated to resolve the narrow dimension, not the long dimension?
  • Is there any function of \alpha and \rho that is known to be more strongly identified? I.e. a GP analogue to the sum of regression coefficients for highly conditionally correlated regressors?

Any advice is appreciated.

Code

# Generate data
set.seed(9)
N <- 520
t <- (1:N) / N
beta <- 0.5 + 0.25 * sin(2*pi * t)
y <- rnorm(N, mean=beta, sd=0.02)

data <- list(N = N, t = t, y = y)
# GP priors and parameters

# put 98% of probability mass between 0.005 and 0.8 for gp_rho
opt_rho <- gamma_solve(c(0.005, 0.8))$par
# put 98% of probability mass between 0.005 and 0.8 for gp_alpha
opt_alpha <- gamma_solve(c(0.05, 0.8))$par

data$gp_rho_prior <- opt_rho # c(1.3, 6.6)
data$gp_alpha_prior <- opt_alpha # c(3.3, 11.2)
data$M_f <- 20
data$c_f <- 1.5

Stan program

// gp_prior_1_beta.stan
functions {
  vector diagSPD_EQ(real alpha, real rho, real L, int M) {
    return sqrt((alpha^2) * sqrt(2*pi()) * rho * exp(-0.5*(rho*pi()/2/L)^2 
    * linspaced_vector(M, 1, M)^2));
  }

  matrix PHI(int M, real L, vector x) {
    return sin(diag_post_multiply(rep_matrix(pi()/(2*L) * (x+L), M), 
    linspaced_vector(M, 1, M)))/sqrt(L);
    }
}
data {
  int<lower=1> N;
  vector[N] y;
  vector[N] t;
  vector[2] gp_alpha_prior;
  vector[2] gp_rho_prior;
  real<lower=0> c_f;   // factor c to determine the boundary value L
  int<lower=1> M_f;    // number of basis functions
}
transformed data {
  real L_f = c_f * max(t);
  // Basis functions for f
  matrix[N, M_f] PHI_f = PHI(M_f, L_f, t);
}
parameters {
  real<lower=0> sigma;
  real<lower=0> gp_alpha;
  real<lower=0> gp_rho;
  real gp_mu;
  vector[M_f] beta_f; // the basis functions coefficients
}
transformed parameters {
  // spectral densities
  vector[M_f] diagSPD_f = diagSPD_EQ(gp_alpha, gp_rho, L_f, M_f);
  // f ~ gp_mu + GP(0, K)
  vector[N] f = gp_mu + PHI_f * (diagSPD_f .* beta_f);
} 
model {
  // Mean model
  y ~ normal(f, sigma);
  // Priors
  beta_f ~ normal(0, 1);
  gp_rho ~ gamma(gp_rho_prior[1], gp_rho_prior[2]);
  gp_alpha ~ gamma(gp_alpha_prior[1], gp_alpha_prior[2]);
  gp_mu ~ normal(0.1, 0.3);
  sigma ~ normal(0, 0.1);
}

R code to run model

s1b <- cmdstanr::cmdstan_model(stan_file = "gp_prior_1_beta.stan")
fit1b <- s1b$sample(data=data, iter_warmup = 500, iter_sampling=1000, 
    chains = 4, parallel_chains = 4, max_treedepth = 10, 
    adapt_delta = 0.95, seed=1)

R gamma dist helper

gamma_solve <- function(bounds, prob=c(0.01, 0.01), plotit=TRUE, init=NULL) {
  lb <- bounds[1]
  ub <- bounds[2]
  f <- function(theta) {
    a <- exp(theta[1]) # shape
    b <- exp(theta[2]) # rate
    sum((c(pgamma(lb, a, b), 1 - pgamma(ub, a, b)) - prob)^2)
  }
  if (is.null(init)) init <- c(log(4), log(8))
  opt <- optim(par=init, fn=f)
  opt$par <- exp(opt$par)
  opt$prob <- pgamma(bounds, opt$par[1], opt$par[2])
  if (plotit) {
    curve(dgamma(x, opt$par[1], opt$par[2]), from=0, to=1, n=1000, lwd=2, 
          ylab="density", main=paste0(c("Shape: ", "Rate: "), 
                                      round(opt$par, 2), collapse = "\t"))
  }
  opt
}
3 Likes

pinging people who probably know more about this: @mike-lawrence, @avehtari

1 Like

Those priors don’t look like very strong.

When c_f ← 1.5, the first basis function is close to constant and there is strong correlation between gp_mu and beta_f[1] which can explain large treedepths. You could try dropping gp_mu, especially if y is centered to have mean zero.

Yes. You should plot also do pair plots with sigma, as in this case gp_alpha^2 + sigma^2 is approximately the estimated total variance.

More likely because gp_mu and beta_f[1] are highly correlating and you’re using diagonal mass matrix. You could try with dense mass matrix (low rank + diagonal would be more efficient, but not yet available)

I’m not aware, but there can be. More common is to integrate beta_f out analytically (if the model is normal-normal)

You could make the current model faster with normal_id_glm

You could use more informative prior for sigma, e.g. something which would make the values really close to 0 less likely if you know something about the minimum measurement accuracy

1 Like

I think you should try adjusting the “centeredness” of the basis function coefficients, depending on how well identified each coefficient is.

If this is to cryptic a recommendation, you can wait a few days and I might have an example with an explanation.

1 Like

Thank you very much for the detailed response. This is all very helpful.

I may have abstracted too far away from my actual problem with my simplified example, so I’ll just clarify on these two points.

Those priors don’t look like very strong.

The priors are quite strong with respect to my actual use case; I don’t know how smoothly the relationship between y and X (i.e. \beta) varies, or the total range of \beta over the sample. Here my priors look quite similar to my synthetic \beta.

You could try dropping gp_mu, especially if y is centered to have mean zero.

Yes, I could do that with this simulated data, because there is no covariate X, but I don’t I think I can more generally, as centering the variables won’t center their dependence (\beta), which is the thing I’m modeling as the GP.

More likely because gp_mu and beta_f[1] are highly correlating and you’re using diagonal mass matrix. You could try with dense mass matrix…

This is a great insight. I think I could have spent a year in a room thinking about this and still not have joined these dots.

Going to mark this closed as there is plenty here for me to work on and test. Thanks again!

Thanks Niko,

I think I understand this, but I’m still quite early on the Dunning-Kruger curve, so I could be completely misunderstanding. The reason I didn’t centre the basis functions was due to the ‘pinching’ effect it seems to have on the prior (I asked about this in another post), but this is probably a sensible compromise to make if it significantly helps the identifiability. I did not consider centering differents bfs individually… If you do find you have an example with an explanation it would be very much appreciated!

Hey,

yes, centering the basis function coefficients will lead to funnel(s) in the prior distribution. However, from the looks of your simulated data, it looks like the posterior has inverted funnels for the first basis function coefficients, as they seem very well-identified (disregarding the influence from any intercept term).

Okay, so just a short illustration without any practical actionable recommendation:

For what follows, the “non-centered” parametrization is the one where the model parameters corresponding to the basis function coefficients have a standard normal prior, independent from the GP hyperparameters and the “centered” parametrization would be the one were you parametrize the model directly in terms of the basis function function coefficients with the appropriate prior.

In the two attached figures you’ll see pairplots for 20 basis functions (rowwise) of gp_unit_sigma vs. gp_basis_function_coefficient (first three columns) and gp_unit_lengthscale vs. the same (last three columns). The first column in each group shows the non-centered parametrization, the third column the centered parametrization and the middle column a hybrid parametrization found via optimization.

One figure shows samples from “the” prior and the other shows sample from a posterior. For samples from the prior, the non-centered parametrization is clearly superior to the centered parametrization, which has funnels all over the place. For samples from the posterior on the other hand, for the first basis function coefficients, the centered parametrization appears to be better than the non-centered one and vice versa for the last basis function coefficients. For all basis function coefficients, the hybrid parametrization appears at least as good as the best of the other two.

Prior:

Posterior:

I think adjusting the parametrization should lower the treedepth, but I’m not sure, as I haven’t implemented that part yet.

1 Like

Hi Niko,

Thank you for the follow-up! Sincerely appreciated. That’s a striking difference between the two parameterisations, and also the prior/posterior.

I don’t think I quite understand the centered / non-centered definitions though, I think I interpreted incorrectly before. I was thinking of ‘centered’ as de-meaned bfs, not a change in parameterisation.

the “non-centered” parametrization is the one where the model parameters corresponding to the basis function coefficients have a standard normal prior, independent from the GP hyperparameters

OK, so that sounds like the “default” implementation right? Like how I have it in the code above?

the “centered” parametrization would be the one were you parametrize the model directly in terms of the basis function function coefficients with the appropriate prior.

This confuses me. Would you keep the original spectral densities? I guess the ‘direct’ coefficient priors have to be normal for the prior to still qualify as a GP? Are there any details regarding implementation you can share, even just a few lines that can clarify the centered implementation.

Thanks again!

Oh! I see now where and why I have lost you! I didn’t make the connection to that previous post (of yours) Hilbert space approximate GP prior: why use de-meaned basis functions (PHI)?

I see why this has been confusing 😅

Anyways, when I talked about the centeredness of the parameterization I wasn’t talking about demeaning the basis functions. “Centered” / “Non-centered” parameterizations are the common but very unhelpful names for two specific ways of parametrizing hierarchical models, see eg 24.7 Reparameterization | Stan User’s Guide But of course in your last post everybody meant demeaned / non-demeaned basis functions when saying centered / non-centered.

I wanted to write a something up anyways, but this may take a few days in which I figure out the details myself. In the meantime the above link may be interesting, if not necessarily helpful for your current problem.

Edit:

Yes.

Yes, it’s only a change of variables.

Yes, they should still be normal after reparametrization (but I don’t know wheter this is always necessary for GP discretizations).

In an ideal world, the so called centered parametrization (for a simpler model) would be

parameters {
   real<lower=0> sigma;
   real x;
}
model {
   sigma ~ lognormal(0,1);
   x ~ normal(0, sigma);
}

and the non-centered one would be

parameters {
   real<lower=0> sigma;
   real<offset=0, multiplier=sigma> x;
}
model {
   sigma ~ lognormal(0,1);
   x ~ normal(0, sigma);
}

which internally would look something like this

parameters {
   real<lower=0> sigma;
   real xi;
}
transformed parameters {
  real x = sigma * xi;
}
model {
   sigma ~ lognormal(0,1);
   xi ~ normal(0, 1);
}

But I think there is some bug in the implementation of the offset/multiplier implementation where it sometimes gets stuck in weird places for numerical reasons.

2 Likes

Hi,
Thanks again for your helpful reply!

Oh! I see now where and why I have lost you!

Yep, that’s what tripped me up at first!

“Centered” / “Non-centered” parameterizations are the common but very unhelpful names for two specific ways of parametrizing hierarchical models…

I understand centered vs non-centered parameterizations in the standard models, I just don’t don’t quite understand them in the context of the GP basis function coefficients. Do you mean that for the first x basis functions, rather than having e.g.
f_nc = gp_mu + PHI_f * (diagSPD_f .* beta_f)
beta_f ~ normal(0, 1),
you would have something like:
f_c ~ normal(gp_mu, PHI_f * diagSPD_f ) ?

(I know I have ignored dimensions and slicing etc in the code above, and this wouldn’t actually run as-is. If it’s too hand-wavy to catch what I’m saying I can tighten it up.)

Ideally I would want to have something like this:

data {
  //...
  vector<lower=0, upper=1>[gp_no_basis_functions] gp_centeredness;
}
parameters {
   real<lower=0> gp_sigma;
   real<lower=0> gp_lengthscale;
   vector<multiplier=pow(diagSPD_f,1-gp_centeredness)>[gp_no_basis_functions] gp_basis_function_coefficients;
}
transformed parameters {
   vector[no_measurements] f = gp_mu + PHI_f * gp_basis_function_coefficients; // Or the faster alternative proposed by Aki
}
model {
   //...
   gp_basis_function_coefficients ~ normal(0, diagSPD_f);
}

But I tried this and the sampler gets stuck, I think due the bug in the implementation.

1 Like

Ahh ok. That’s helpful, thanks!

Hope it helps. If there are further questions or maybe a nice example, don’t hesitate to share them.