Non-identifiability for Gaussian process with random intercepts

I am considering a latent Gaussian process f which is observed through k different units with different intercepts b. The model is

\begin{aligned} f&\sim \mathsf{GP}\left(K\right)\\ b_i&\sim\mathsf{Normal}\left(0, 1\right)\\ y_{ij}&\sim\mathsf{Normal}\left(b_i+f_j,\kappa_i^2\right), \end{aligned}

where K is the known covariance matrix, \kappa_i^2 is the observation variance for unit i, and y is the matrix of observations. The model is motivated by spectroscopic measurements: the shared GP is the unknown spectrum we’re after, and the biases capture how “bright” the sample is.

Naively fitting this model fails because the model is not (or only weakly) identifiable. E.g., we can subtract 1 from the Gaussian process and add 1 to the intercepts without changing the likelihood (and the priors aren’t strong enough to break the degeneracy). In practice, elements of f and b are anti-correlated. f and b both have positive correlation internally.

Intuitively, I’d like a Gaussian process that has zero sample mean rather than zero population mean. That would mean the degeneracy disappears and sampling could (hopefully) proceed without issues. I’ve tried a few different approaches but haven’t yet found a nice solution.

  1. Simply subtract the sample mean from the GP realization, i.e., proceed as before except modify the observation model to y_{ij}\sim\mathsf{Normal}\left(b_i+f_j-\bar f,\kappa^2\right). This removes the correlation between f and b but of course leaves elements of f correlated internally.
  2. Integrate out b analytically, but, in practice, the observation model is not normal. So this would require some manual work that I’d preferably avoid.
  3. Use a non-centered parameterization f=Lz , where L is the Cholesky decomposition of K and z is white noise. Mean-subtracting the columns of L gives a zero-mean Gaussian process (effectively a non-centered version of 1.). As far as I can tell this might just be a special case of @avehtari’s de-meaned basis functions (discussed a bit here: Hilbert space approximate GP prior: why use de-meaned basis functions (PHI)?). This does a little better but still has correlations (because we fundamentally have two many parameters).
  4. Let’s call the de-meaned Cholesky decomposition \tilde L which is rank deficient (because we de-meaned) so one of the singular values is zero. We can now generate white noise z, scale by the singular values, and then rotate back to the original space of \tilde L. This is nice in theory because one of the parameters is completely decoupled from the rest of the model, and we are left with the “right” number of parameters. In practice, it’s pretty slow because we need to evaluate the Cholesky decomposition and SVD (in practice, I don’t know K ahead of time).
  5. Using more informative priors. Unfortunately, there can be both strong variability in the intercepts (different samples can have very different brightnesses) and strong variability in the GP (the spectrum varies).
  6. Pinning the intercept of one sample to zero improves the fitting somewhat. But if there are many samples, their covariance can “overwhelm” the information imposed by pinning one of the intercepts.
  7. Pinning one of the values of the GP to zero similarly improves the process slightly, but it results in odd behavior. The fit, understandably, has zero variance at the pinned value, and the rest of it varies like a dog wagging its tail–not something we’d expect a spectrum to do. The tail wagging also retains much of the correlation between f and b.

Any ideas you might have about how to fit this model would be much appreciated! It feels like this should be a simple reparameterization to eliminate one of the variables. But I’m struggling to figure it out.

2 Likes

@avehtari

Is there any strong reason why its preferable to enforce zero sample mean on the GP rather than zero sample mean on the intercepts?

In principle, enforcing a zero sample mean for the intercepts would also work. Are you thinking of something like b_n=-\sum_{i=1}^{n-1}b_i? My hunch is that we might observe similar behavior to 6. I’ll give that a go and report the results here (and I should share the models for reproducibility).

yep, that’s the “hard constraint” way. The “soft constraint” way is to just add something like target += normal_lpdf(sum(b) | 0, 0.001)

Sorry I cannot go through the entire list of options you have right now (may try to do it later, but… December), but I have some experience trying to fit random slopes plus a gaussian process, and not surprisingly there were identifiability issues there too. I am not aware of a real example outside of toy textbook problems where \mu = f are identifiable in y \sim \mathcal{N}(\mu, K), and would be happy to see one.

I think what was happening in my case was the slopes were capturing most of the trend, and the remaining, nonlinear pattern was small compared to the noise and the GP ended up just modeling that. Intuitively, I think that GPs are almost too flexible, so they will compete with everything but the simplest function and there’ll be this kind of nonidenfiability. Our solution was to simply fix the mean of the data and let the GP model everything else.

I think your assumption is that only the baseline changes between samples, so they would simply be shifted with respect to the actual spectra, but an alternative is that the data is stretched. In the former case an alternative is fixing the mean for each sample, for the latter, you could estimate a single intercept for all spectra and let the GP pick up the rest.

As b_i have normal prior and y has normal distribution, you could include the effect of b_i in the covariance matrix K, and analytically integrate over f and b, to get stable inference for the covariance function parameters and the observation variance.

Thank you for all the suggestions! I’ve coded them up here to compare the different methods (very rough, single chain, not checking for divergences or Rhat, etc.). They are all able to infer the GP and intercepts (up to a random offset that isn’t material).

In terms of runtime and reducing posterior correlation, placing a soft zero-mean constraint on the intercepts seems to work well, as suggested by @jsocolar.

I like @avehtari’s suggestion of integrating out the biases and GP. Provided I understand correctly, the data y would not only be correlated within each unit but would also be correlated between samples (due to the marginalized GP f). I have on the order of 30 units and 100 observations per unit which would leave a 3000 by 3000 covariance matrix. Probably doable but big-ish.

2 Likes

3000 x 3000 is not that big for GPs in general. For example, the cover of BDA3 is based on GP with 8000 x 8000 covariance matrix, and that was ten years ago (since then the computation speed has increased a lot), However, Stan’s autodiff design is not good for user defined covariance functions, and that can make this not feasible in Stan. There are plans how to make GPs with user defined covariance functions in Stan faster, but as that requires big changes it will take long time.