Gaussian process not recovering alpha parameter in cov_exp_quad

I’m trying to write a latent variable Gaussian process regression in which the GP varies over time, not over my exogenous variable values.

I’ve attached the code. I’m first simulating Ys, by taking the median of fake data generated in sim_data.R and using single_y_gp_sim.stan, then attempting to recover the parameters I set using single_y_gp.stan. I’m getting the length-scale parameter, rho, alright, but for some reason, no matter what I set alpha too, my alpha posterior distributions are too low.

What am I doing wrong? Is it taking the median of the simulated data?

sim_data.R (1.3 KB)
single_y_gp_sim.stan (944 Bytes)
single_y_gp.stan (1023 Bytes)

Thanks for your help!

Could you write out the GP covariance function you’re trying to get at?

I had a look at this, but I had trouble figuring out the code. Might just be my brain being rusty though.

Is there an easier version where the x is one dimensional as well :D?

Hi,

Please find attached. The problem persists in the simpler version, so at least we have that possibility eliminated. I’m also not recovering alpha if I use a single simulated y from single_xy_gp_sim, instead of taking the median.

single_xy_gp_sim.stan (700 Bytes)
single_xy_gp.stan (793 Bytes)
single_xy_gp.R (791 Bytes)

Hmm, so is the goal to:

  1. generate a time series (eta) from an AR(1) model
  2. add a zero mean GP on top
  3. then try to recover both the parameters of the zero mean GP as well as the eta time series?

Do I have that right?

So something like:

y ~ multi_normal(eta, K); // sigma added in to diagonal of K here

That integrates out all the latent variables of the GP. You could also do:

transformed parameters {
  vector[N] u;
  L = cholesky_decompose(K);
  u = L * z;
}
model {
  z ~ normal(0, 1);
  ... // Other priors
  y ~ normal(eta + u, sigma);
}

If you wanted to get the latent values of the GP. Is something like this what you were looking for?

It would probably be super hard to separate eta and u here if so.

If this isn’t what you’re looking for then I gotta get some more explanation on f = L_K * eta .* x; :D. That’s the thing I don’t think I’ve run across before.

I’m trying to make something like a Kalman Filter, in which betas from a linear regression on y vary smoothly over time. The eta I generated was meant to be a reasonable approximation of betas that vary smoothly. So:
y ~ multi_normal(eta .* x, K);
where x is an exogenous variable, and eta is the vector[N] equivalent of a constant beta in a linear regression.

Does that make sense?

I’m gonna think some more, but see if there is anything helpful in this thread: AR/Kalman-like models where the k-th estimate of a variable depends only on obs 1..k-1

The eta I generated was meant to be a reasonable approximation of betas that vary smoothly.

Hmm, okay, the words seem reasonable at least. (edit: that sounds sarcastic but what I meant was I understood what you said :P) The thing I don’t like about this is the GP sitting on top of everything.

Why not just y ~ normal(eta .* x, sigma)? Where eta is a smooth thing (we can put a zero mean GP prior on it)?

If I understand your formulation, that looks something like:

model {
  matrix[N,N] K = cov_exp_quad(time, alpha, rho);
...
  eta ~ multi_normal(0, K);
  y ~ normal(eta .* x, sigma);
}

I’ll try to simulate and recover with it written up that way.

I want to get to a point where y, and eta are matrixes and x is a three-dimensional array. So each y is associated with a matrix of x, and each column of x has the same eta column associated with it for all y columns. So, going this route, each column of eta is eta[,d] ~ multi_normal(0, K[d]), and each row of y is y ~ multi_normal(eta[n] * x[n], Sigma). That’s how I have the full model written at the moment, anyway. But I was trying to build up from simpler models before testing that one, and got stuck on alpha being consistently too low.

Yeah, seems reasonable. Give it a go and see what happens.

Ok, so that time I got it!

I wrote a function to generate the fake eta values within stan, instead of that AR(1) process I had before, and the values generated didn’t look at all like what I was getting in the first few tries. So the alpha I was trying to recover was unrealistic.

I attached the code I used. Thanks for the help! Hopefully I can scale up from here.

single_xy_gp_sim2.stan (1.0 KB)
single_xy_gp2.R (827 Bytes)
single_xy_gp2.stan (802 Bytes)

Cool beans. GPs can get a little funky to fit, so if things get weird, ask.

From the Stan prior recommendations page (https://github.com/stan-dev/stan/wiki/Prior-Choice-Recommendations#priors-for-gaussian-processes): “In practice when we fit Gaussian processes we often either set the length-scale parameter to a fixed value corresponding to some desired level of smoothing, or we give it a strong prior. The identification issues are real”

So don’t get too caught up in the hyperparameters.

I always see @mike-lawrence and @Andre_Pfeuffer around here in threads about GPs and ARs and stuff. There have been some interesting threads around. If you’re bored you might search their names and see if you find anything interesting.

We have much better recommendations than that! Paper and case study coming, but long story short use an inverse gamma on the length scale where you’ve chosen the parameters so the left hand tail does not give very much mass below the median spacing of your observations [or some representative number]. This fixes all of your problems. Paper forthcoming (hopefully) with @rtrangucci @betanalpha and @avehtari (See also https://arxiv.org/abs/1503.00256)

2 Likes

For the moment you can read about our current recommendations (that will be expanded in the paper and the case study) in the latest version of the manual.

1 Like

Then the only recommendations on that page for GPs then that are still valid are these two

See also recommendations in Section 18.3 in Stan reference manual
For Matern fields, then the joint penalised complexity prior is available for the parameters (variance, range) parameters

Should we just delete the rest?

Probably.

Interesting. I thought the Betancourt Trangucci workshop paper from last year was really useful. I’ll never NOT put a prior on my length scales now.

I was talking about priors on the length-scale (alpha) with a lab-mate which led me to an interesting perspective, I wanted to get the thoughts of @anon75146577 and @betanalpha. Basically for the squared-exponential kernel the eigen-functions are modified Hermite polynomials, so we can think about GP regression as linear regression with an infinite number of Hermite polynomial terms. Different values of the length-scale will lead to different bases of those modified Hermite terms, but no matter what the length-scale is, you’ll have a complete basis for representing all L2 functions.

From that perspective the length-scale will be un-identifiable because they’ll all lead to a basis that’s expressive enough to represent your L2 function. This is why I think putting good priors on the length-scale is so helpful in practice, because you’re able to put preference on which of these bases to use in a principled way.

Thoughts?

Mercer’s theorem says that every covariance function induces a complete basis for representing L2 functions, so this isn’t unique.

Also for a squared exponential, the length scale is identified!

Where identification here is the technical term, meaning that with infinite data you get a likelihood that is non-zero at only a single point in parameter space.
For finite data, however, the hyperposterior of the squared exponential kernel manifests the same curved shape of a non-identified kernel like the Matern.