Using GP for autocorrelation in a time-series model: memory pressure, GP kernel definition

Hi there,

I was wondering if there was a link or resource that you had intended to include in your post? I am trying to see if I can get to where I can fit this model as a GP using non-centered parameterization and improving my priors, as you mentioned before I try estimating it using splines. If you had a reference in mind, I’d love to know about it. Otherwise I’ll keep sifting around.

Cheers,

Connor

Whoops, my bad. https://betanalpha.github.io/assets/case_studies/gp_part3/part3.html.

1 Like

Hey Mike,

I have been trying to find an example of this Kronecker trick applied and the ones I can find (in @rtrangucci 's github repository) seem to incorporate responses over a 2+D grid. I just want the GP over a single (time) dimension. Are you aware as to whether I can apply the same method in a single dimension? This model:

seems close to what I want (with D = 1). If so, then I guess the Kronecker trick basically comes down to this line?

    latent_gp = L_K * y_tilde * diag_pre_multiply(alphas, L_Omega)';

It seems like the scale of K hasn’t changed. Would efficiency gains related to the separability of covariance structures only apply if we’re talking about computational limits arising from multiple outputs over some grid?

When I run the above model with my data (just modeling the latent gp, leaving out the wind predictor), I still estimate two variables, latent_gp and y_tilde, every time I take a draw from the posterior. As such, I could use some clarification on how or whether the Kronecker trick can reduce the computational load of the model I’m aiming for. Any thoughts would be greatly appreciated.

Cheers,

Connor

1 Like

I am not aware of how a Kronecker covariance matrix could be derived from purely 1-dimensional input. Limiting the bandwidth of a typical covariance function could result in a banded covariance matrix which could be inverted much more quickly, but that’s a modeling choice.

functions {
matrix approx_L(int M, real scale, vector x, real l) {
int N = rows(x);

    real epsilon = 1 / (sqrt2() * l);
    real alpha = 1 / scale;
    real beta = (1.0 + (2.0 * epsilon / alpha)^2)^0.25;
    real delta =  alpha* sqrt((beta^2 - 1.0) / 2.0);
    vector[N] Ht[M];
    matrix[N, M] Htt;
    vector[N] xp = alpha * beta * x;
    real fsq = epsilon^2 / (alpha^2 + delta^2 + epsilon^2);
    real f = sqrt(fsq);
    Ht[1] = sqrt(sqrt(alpha^2 / (alpha^2 + delta^2 + epsilon^2))) * sqrt(beta) * exp(-delta^2 * x .* x);
    Ht[2] = f * sqrt(2.0) * xp .* Ht[1];
    for(n in 3:M) {
      Ht[n] = f * sqrt(2.0 / (n - 1.0)) * xp .* Ht[n - 1] - fsq * sqrt((n - 2.0) / (n - 1.0)) * Ht[n - 2];
    }

    for(n in 1:M) {
      Htt[:, n] = Ht[n];
    }
    return  Htt;
  }
}
data {
  int<lower=1> N; // number of (time) points in series
  vector[N] currents; // vector of Alongshore current speeds
  vector[N] wind; // vector of wind speeds
  //matrix[N,N] Dmat; // Distance matrix (scaled = /3600)

  int scale;  // set to 0.2
  vector[N] Distance;
  int<lower=1> M;  // approx points

  //prediction
  int<lower=1> S; // length of wind series to simulate over
  vector[S] sim_wind; // vector of wind values to simulate over
}
parameters {
  vector[M] g; // g is the variable intercept offset
  real a; // overall model intercept
  real b; // slope describing relationship to wind predictor
  real<lower=0> sigma_currents;
  real<lower=0> eta_sq;
  real<lower=0> inv_rho_sq;
  real<lower=0> sigma_sq;
}

transformed parameters {
  real<lower=0> rho_sq = inv(inv_rho_sq);
}

model {
    vector[N] mu;
    //priors
    eta_sq ~ cauchy(0,1);
    inv_rho_sq ~ cauchy(0,1);
    sigma_sq ~ cauchy(0,1);
    a ~ normal(0, 5);
    b ~ normal(0, 1);
    sigma_currents ~ normal(0, 10); // overall model sigma
    //model
    g ~ normal(0, 1);
    mu = a + approx_L(M, scale, Distance, inv_rho_sq) * g + b * wind;
    currents ~ normal(mu, sigma_currents);
}
generated quantities{
  vector[S] sim_cur; //simulated currents output
  sim_cur = a + b*sim_wind;
}

You can try this one. I modified the approx_L in, the approx_L is
from bbales2.

Choose M to something reasonable, since your frequencies are low, M doesn’t need to be very high.
Nyquist criterion.

Added, the mean of Distance should be zero. Not done in my code.

The approximation will work much better if the mean of the independent variable x is zero! The approximate kernels are not translation invariant or stationary or however you think about it! Here’s a picture of the first ten eigenfunctions (scaled by the square root of their eigenvalues). The approximation will fail if there is data out near the boundaries (< ~-1.2 or > ~1.2) cause all the eigenfunctions are near zero there:

1 Like

Since you are working with one-dimensional GPs you might be interested in the work that has gone into converting GPs with particular choices of kernel into Kalman filters.
The original work is by Simo Särkkä’s group, e.g.
J. Hartikainen and S. Särkkä, “Kalman filtering and smoothing solutions to temporal Gaussian process regression models,” in 2010 IEEE International Workshop on Machine Learning for Signal Processing, 2010, pp. 379–384.

but I also find this work from Oxford to be quite neat:
https://arxiv.org/abs/1510.02830

2 Likes

Here is also a blogpost with stan code

I only have one question. Can we somehow make it scale as linear and not quadratic?

2 Likes

Good question.

Do I see it right? SDE is slower.

http://www.juhokokkala.fi/blog/posts/linear-time-evaluation-of-matern-32-gp-density-in-stan-using-the-sde-representation/

Inference for Stan model: Matern32Poisson_basicGP_reparametrized_model
1 chains: each with iter=(2000000); warmup=(0); thin=(1); 2000000 iterations saved.

Warmup took (30) seconds, 30 seconds total
Sampling took (2886) seconds, 48 minutes total

Inference for Stan model: Matern32Poisson_SDE_reparametrized_model
1 chains: each with iter=(2000000); warmup=(0); thin=(1); 2000000 iterations saved.

Warmup took (26) seconds, 26 seconds total
Sampling took (3629) seconds, 1.0 hours total

That is for 5 points.

For 100 points

Inference for Stan model: Matern32Poisson_basicGP_model
1 chains: each with iter=(1000); warmup=(0); thin=(1); 1000 iterations saved.

Warmup took (2458) seconds, 41 minutes total
Sampling took (1827) seconds, 30 minutes total


Inference for Stan model: Matern32Poisson_SDE_model
1 chains: each with iter=(1000); warmup=(0); thin=(1); 1000 iterations saved.

Warmup took (127) seconds, 2.1 minutes total
Sampling took (105) seconds, 1.7 minutes total


Inference for Stan model: Matern32Poisson_basicGP_reparametrized_model
1 chains: each with iter=(1000); warmup=(0); thin=(1); 1000 iterations saved.

Warmup took (1240) seconds, 21 minutes total
Sampling took (1091) seconds, 18 minutes total


Inference for Stan model: Matern32Poisson_SDE_reparametrized_model
1 chains: each with iter=(1000); warmup=(0); thin=(1); 1000 iterations saved.

Warmup took (64) seconds, 1.1 minutes total
Sampling took (56) seconds, 56 seconds total

Not really sure if the properties are right? warmup=0 --> warmup took 40 minutes.

Hey Andre and Ben,

@andre.pfeuffer
@bbbales2

I know this is fairly old, but I am still trying to get this model to work for my purposes. A quick reminder- I am trying to use the approximated GP to model error due to temporal autocorrelation in a model describing the relationship between winds and ocean currents.

I’ve (finally) realized that I am having a problem with the “Distance” vector, x, in the approx_L function. I have been giving the model a single distance vector centered at zero to use for the fit. That vector represents the distance in time between sampling points, but following earlier advice it is demeaned (so that the mean is zero). What is happening is that the error term is really only valid for the middle of the data set, because that is where distance is zero due to the demeaning procedure. Thus, the distance to the other data points is based on the exactly middle data point’s location in time.

The posterior distribution of the parameters (there are M of them), ‘g’, that describes the variable intercept due to autocorrelation error ends up looking a lot like zero. But if you pull the GP out and do some inference there, you see that it is near zero at the extremes and has non-zero values right around the data points nearest the center of the dataset, i.e., where the distance vector is accurately describing distance to other points. So the GP approximated error seems to work, but only where the distance vector accurately represents the relationship of the current data point to all of the other data points. Here is a plot showing the mean posterior value of ‘gp’ where

gp = approx_L(M, scale, Distance, inv_rho_sq) * g;

What it seems like I need to do is update the distance vector that I feed to approx_L(), which is called xt in Ben’s original function, such that it is zero for the present data point, rather than setting the middle of the data set as the zero position. Does that make any sense? So I think I need to adapt approx_L() to deal with a different distance vector depending on the data point for which I’m estimating an error term.

Does any of this seem to make sense? Any ideas on how to make that work in the stan code?

I appreciate your thoughts!

Here’s my model:

functions {
matrix approx_L(int M, real scale, vector x, real l) {
    int N = rows(x); 
    real epsilon = 1 / (sqrt2() * l);
    real alpha = 1 / scale;
    real beta = (1.0 + (2.0 * epsilon / alpha)^2)^0.25;
    real delta =  alpha* sqrt((beta^2 - 1.0) / 2.0);
    vector[N] Ht[M];
    matrix[N, M] Htt;
    vector[N] xp = alpha * beta * x;
    real fsq = epsilon^2 / (alpha^2 + delta^2 + epsilon^2);
    real f = sqrt(fsq);
    Ht[1] = sqrt(sqrt(alpha^2 / (alpha^2 + delta^2 + epsilon^2))) * sqrt(beta) * exp(-delta^2 * x .* x);
    Ht[2] = f * sqrt(2.0) * xp .* Ht[1];
    for(n in 3:M) {
      Ht[n] = f * sqrt(2.0 / (n - 1.0)) * xp .* Ht[n - 1] - fsq * sqrt((n - 2.0) / (n - 1.0)) * Ht[n - 2];
    }

    for(n in 1:M) {
      Htt[:, n] = Ht[n];
    }
    return  Htt;
  }
}
data {
  // model parameters
  int<lower=1> N; // number of (time) points in series
  int<lower=0> K; // number of predictos
  matrix[N, K] X; // predictor design matrix
  vector[N] response; // vector of Alongshore current speeds

  // GP apprximation parameters (modeling error as approx GP)
  real scale;  // set to 0.2
  vector[N] Distance;
  int<lower=1> M;  // number of eigenvectors of var-covar matrix to retain to approximate the GP

  //prediction
  int<lower=1> S; // length of wind series to simulate over
  int<lower=0> sK; // number of predictors for simulation
  matrix[S, sK] SIM; // matrix of ismulated values
}
parameters {
  vector[M] g; // g is the variable intercept offset
  vector[K] B; // coefficients of the model (both intercepts and slopes)
  real<lower=0> sigma_response;
  // real<lower=0> eta_sq;
  real<lower=0> inv_rho_sq;
  // real<lower=0> sigma_sq;
}

transformed parameters {
  real<lower=0> rho_sq = inv(inv_rho_sq);

  vector[N] gp; // gaussian process approximation; represents temporal autocorrelation
  gp = approx_L(M, scale, Distance, inv_rho_sq) * g;

}

model{
    vector[N] mu;
    //priors
    // eta_sq ~ cauchy(0,1);
    inv_rho_sq ~ cauchy(0,1);
    // sigma_sq ~ cauchy(0,1);
    B[1] ~ normal(0,5); // intercept prior
    B[2:K] ~ normal(0,1); // slope priors
    g ~ normal(0, 1);


    sigma_response ~ normal(0, 5); // overall model sigma; 5 is probably too big for the variance hyperparameter
    //model
    mu = X * B + gp;
    response ~ normal(mu, sigma_response);
}
generated quantities{
  vector[S] sim_cur; //simulated response output
  vector[N] mu_pred;
  vector[N] pred_cur;
  vector[N] log_lik;
  vector[N] resids;
    sim_cur = SIM * B; // simulated response across the range of params
    mu_pred = X*B + gp; //+ approx_L(M, scale, Distance, inv_rho_sq) * g;
  for(i in 1:N){
    pred_cur[i] = normal_rng(mu_pred[i], sigma_response); // posterior predictions of the model
    log_lik[i] = normal_lpdf(response[i] | mu_pred[i], sigma_response); // compute log likelihood for us in AIC 
  }  
  resids = response - mu_pred;
}

Take a look at:

matrix[P, M] L = approx_L(M, scale, xp, sigma, l);
log10error = log10(max(fabs(cov_exp_quad(xp, sigma, l) - L * L')));

from:
https://github.com/bbbales2/basketball/blob/master/models/approx_bernoulli_gp_fixed_sigma.stan

(and ignore x and xp are messed in the source). So the approx_L needs the absolute (centered) points and not the distance.

I use the following function:

normalize <- function(x) {
  x <- as.matrix(x)
  minAttr=apply(x, 2, min)
  maxAttr=apply(x, 2, max)
  x <- sweep(x, 2, minAttr, FUN="-")
  x=sweep(x, 2,  maxAttr-minAttr, "/")
  attr(x, 'normalized:min') = minAttr
  attr(x, 'normalized:max') = maxAttr
  return (x)
}

ddate.num <- normalize(as.numeric(x$dates))

Added: have to subtract 0.5 to be mean centered

When making predictions using cov_exp_quad in generated_quantities doesn’t require
derivatives, so it will not hit the performance. Copied the following from a k = 1,…, N, 2 dimensionals GP, I’ve written one year ago. Its needs modification, but clearly shows the idea:

int N1 = s1[k];
int N2 = s2[k];
matrix[N1, N2] k_x1_x2 = cov_exp_quad(to_array_1d(ddate1[pos1:(pos1 + N1 - 1)])  // N1
                           , to_array_1d(ddate2[pos2:(pos2 + N2 - 1)])  // N2
                           , 1.0, l);
matrix[N1, N1] K = cov_exp_quad(
      to_array_1d(ddate1[pos1:(pos1 + N1 - 1)]), 1.0, l);
matrix[N1, M[k]] L;
matrix[N1, 2] t;
matrix[N1, N1] L_K;
matrix[N2, 2] t_hat;
vector[N1] K_div_y1;
vector[N2] tmp;
for (n in 1:N1)
  K[n, n] = K[n,n] + 1e-14;
L_K = cholesky_decompose(K);
L = approx_L(M[k], scale, ddate1[pos1:(pos1 + N1 - 1)], l);

t = L * err[IndexStarts_knots[k]:IndexEnds_knots[k]];

for(i in 1:2) {
  K_div_y1 = mdivide_left_tri_low(L_K, t[, i]);
  K_div_y1 = mdivide_right_tri_low(K_div_y1', L_K)';
  t_hat[, i] = k_x1_x2' * K_div_y1;
}
2 Likes

Yeah what @andre.pfeuffer said it right. To do this you gotta center things and then make sure the scale isn’t so big. This is the annoying thing about the approximation, the GP only cares about distances between points, but the approximation ends up caring about the absolute position.

If I’m remembering correctly how far the points can be from zero and be accurately described is determined by a combo of the length scale (larger length scale, and you can be farther away) and number of basis functions (more basis functions, and you can be farther away). And then there’s that scale parameter that I have forgotten the details of but is important :D.

I think the easiest way to figure out if you’re okay or not is to plot a few covariance matrices, one computed with the approximation and one with the real thing and compare. Try various length scales and scales and see what happens. It’s extra work, but that’s the cost of this kind of expansion :P.

So center those numbers and scale them a little.

If your xs are [-1000, 0], then just rescale them to [-0.5, 0.5] or something (like @andre.pfeuffer 's code does) , and then just remember your length scale changed units.

Probably too late to be helpful, but a) I also would have thought an SDE would make more sense than a GP with a long series of time structured data (but interested in any opinions to the contrary!), b) the currents look a fair bit like a sunspots example I replicated from elsewhere, with a 2nd order linear sde + moving average measurement model – carma(2,1). In case it’s of interest:

install.packages("devtools")
library(devtools)
install_github("cdriveraus/ctsem")
library(ctsem)


 sunspots<-sunspot.year
 sunspots<-sunspots[50: (length(sunspots) - (1988-1924))]
 id <- 1
 time <- 1749:1924
datalong <- cbind(id, time, sunspots)

#setup model
 ssmodel <- ctModel(type='stanct', n.latent=2, n.manifest=1, 
  manifestNames='sunspots', 
  latentNames=c('ss_level', 'ss_velocity'),
   LAMBDA=matrix(c( 1, 'ma1' ), nrow=1, ncol=2),
   DRIFT=matrix(c(0, 'a21', 1, 'a22'), nrow=2, ncol=2),
   MANIFESTMEANS=matrix(c('m1'), nrow=1, ncol=1),
   CINT=matrix(c(0, 0), nrow=2, ncol=1),
   T0VAR=matrix(c(1,0,0,1), nrow=2, ncol=2), #Because single subject
   DIFFUSION=matrix(c(0, 0, 0, "diffusion"), ncol=2, nrow=2))
 
 ssmodel$pars$indvarying<-FALSE #Because single subject
 ssmodel$pars$offset[14]<- 44 #Because not mean centered
 ssmodel$pars[4,c('transform','offset')]<- c(1,0) #To avoid multi modality 

#fit
ssfit <- ctStanFit(datalong, ssmodel, iter=300, chains=2)

#output
summary(ssfit)

ctKalman(ssfit,timestep = .02,kalmanvec=c('y','ysmooth'),plot=TRUE)
par(mfrow=c(3,3))
ctStanPostPredict(ssfit,diffsize=c(5,20),wait=FALSE)
1 Like

Ok, but we have to check the stationary of the time series first. Maybe we need to take the
differences and what makes us sure, that an ARMA(2,1) is appropriate? He mentioned he used a filter. Before applying a filter, one has to detrend. However the model fitting an intercept.

g ~ multi_normal_cholesky(rep_vector(0, N), L); // model variable intercepts as gaussian process over     function Sigma
    for (i in 1:N) {
      mu[i] = a + g[i] + b * wind[i];
    }
    currents ~ normal(mu, sigma_currents);

The graphic in the first post looks to me showing mean centered time-series. Another question,
do both time series share the same length-scale and/or variance?
@mike-lawrence had already a good idea, to use the Kronecker trick.
Using the GP approximation its better to multiply the covariance matrix directly.

I’d do something like this:

y_hat = approx_L(....) * err * transpose(cholesky_decompose(Sigma));
wind ~ normal(y_hat[,1], sigma_wind);
current ~ normal(y_hat[,2], sigma_current);

And following @bbbales2 scaling advise.

Sorry I wasn’t trying to imply that a carma(2,1) was necessarily appropriate! There’s of course a lot more to modelling decisions than 2 minutes reading a post :) Just had the example to hand.

Sorry if I sounded offensive, not intended. I’m not a native speaker.

1 Like

What are the units on this Distance, by the way? Is this Gaussian process a function of time?

If so then yeah, the other stuff in this thread (what @Charles_Driver and @andre.pfeuffer and @ahartikainen are talking about) might make more sense than doing an expansion.

The issue with time variables is that for the expansion we want the domain of the GP to all be centered around the point where we’re going to do the expansion. Time variables tend to not have a nice center to build this expansion around :D.

Yes the gaussian process is a function of time. The time series is a continuous hourly vector of wind speeds fit to a continuous hourly vector of current speeds.

I dug into your Westbrook basketball example a bit more and in looking at the data set for that, I see what you mean about centering around a point. For that example, the “distance” is also essentially time, but relative to the midpoint of the game clock. So you are able to provide a distance from the middle of the game for all of the data points. I tried to feed in a distance from the middle of my own time series, but the “Distance” vector going into the GP approximation looks like (-1449, -1448, -1447, … , 0, 1, 2, 3, … 1499). So it sounds like that isn’t necessarily the wrong approach, but that it needs to be scaled and then the hyperparameters going into the approximation/expansion need to be adjusted as well.

I will play with this- try to get @andre.pfeuffer normalization routine to work with my data and then plot out the “L” approximation and the full GP covariance matrices with a few different values for the length scale and scale parameters.

By the way, regarding @Charles_Driver’s response- thank you for the suggestions. I really don’t know much at all about SDEs, but I will take a look at the option. I’m not familiar with your package and the ctModel function, but I will check it out. BTW, yes the currents that are fit in this model are spectrally de-tided to remove tidal frequencies and butterworth filtered to remove high-frequency noise; while I removed the mean from the series to perform those manipulations, I added the mean back to the series before fitting to the model.

Ah, okay, I’d skip the expansion thing then. Looking at the plot in your first post, there’s just a lot of wiggles there, and that’s gonna take a ton of basis functions to represent. Have a look at the stuff @Bonnevie suggested above, I think.

I realize I should be providing more context. I looked at simple autoregressive models initially in this project, but the attraction to the GP came from its flexibility in dealing with the amount of error due to autocorrelation. For this model, under some (physical, real-world, wind) conditions the model is highly autocorrelated (mostly when winds are strong). But when winds are weak, there is less information there to understand currents. At the very least, the length-in-time scale parameter would change between such conditions. That is basically how I ended up here. That said, I don’t really know what I am doing and am humbly open to any suggestions. I have learned a great deal from working through this and especially on this forum, however, so my continued gratitude to all of you.