I am considering a latent Gaussian process f which is observed through k different units with different intercepts b. The model is
where K is the known covariance matrix, \kappa_i^2 is the observation variance for unit i, and y is the matrix of observations. The model is motivated by spectroscopic measurements: the shared GP is the unknown spectrum we’re after, and the biases capture how “bright” the sample is.
Naively fitting this model fails because the model is not (or only weakly) identifiable. E.g., we can subtract 1 from the Gaussian process and add 1 to the intercepts without changing the likelihood (and the priors aren’t strong enough to break the degeneracy). In practice, elements of f and b are anti-correlated. f and b both have positive correlation internally.
Intuitively, I’d like a Gaussian process that has zero sample mean rather than zero population mean. That would mean the degeneracy disappears and sampling could (hopefully) proceed without issues. I’ve tried a few different approaches but haven’t yet found a nice solution.
- Simply subtract the sample mean from the GP realization, i.e., proceed as before except modify the observation model to y_{ij}\sim\mathsf{Normal}\left(b_i+f_j-\bar f,\kappa^2\right). This removes the correlation between f and b but of course leaves elements of f correlated internally.
- Integrate out b analytically, but, in practice, the observation model is not normal. So this would require some manual work that I’d preferably avoid.
- Use a non-centered parameterization f=Lz , where L is the Cholesky decomposition of K and z is white noise. Mean-subtracting the columns of L gives a zero-mean Gaussian process (effectively a non-centered version of 1.). As far as I can tell this might just be a special case of @avehtari’s de-meaned basis functions (discussed a bit here: Hilbert space approximate GP prior: why use de-meaned basis functions (PHI)?). This does a little better but still has correlations (because we fundamentally have two many parameters).
- Let’s call the de-meaned Cholesky decomposition \tilde L which is rank deficient (because we de-meaned) so one of the singular values is zero. We can now generate white noise z, scale by the singular values, and then rotate back to the original space of \tilde L. This is nice in theory because one of the parameters is completely decoupled from the rest of the model, and we are left with the “right” number of parameters. In practice, it’s pretty slow because we need to evaluate the Cholesky decomposition and SVD (in practice, I don’t know K ahead of time).
- Using more informative priors. Unfortunately, there can be both strong variability in the intercepts (different samples can have very different brightnesses) and strong variability in the GP (the spectrum varies).
- Pinning the intercept of one sample to zero improves the fitting somewhat. But if there are many samples, their covariance can “overwhelm” the information imposed by pinning one of the intercepts.
- Pinning one of the values of the GP to zero similarly improves the process slightly, but it results in odd behavior. The fit, understandably, has zero variance at the pinned value, and the rest of it varies like a dog wagging its tail–not something we’d expect a spectrum to do. The tail wagging also retains much of the correlation between f and b.
Any ideas you might have about how to fit this model would be much appreciated! It feels like this should be a simple reparameterization to eliminate one of the variables. But I’m struggling to figure it out.