Using GP for autocorrelation in a time-series model: memory pressure, GP kernel definition

techniques
performance

#1

Hi All,

I am attempting to model the response of ocean currents to wind forcing as part of a study of coastal oceanographic processes around headlands. My hope is to find the right framework in which to deal with these highly autocorrelated time-series so that I can extend that framework to model numerous different individual relationships between forcing mechanisms (wind, tides, etc.) and responses (current speed, vorticity, etc.).

My currents are measured at two-minute intervals, but I’ve been working with hourly averages because of the very large amount of data at 2 minutes. Ideally I would be able to use the full data set, but I can’t even get a model run with hourly data, so let’s start with that.

I have about 2000 observations of currents and wind. I have been trying to fit the model to current data that I have harmonically detided and low-pass filtered, though I could use the raw data (and supply it here if requested) as well. Here is what the series looks like (currents in black, wind in red).

My problem is that I am using a ton of RAM to deal with this covariance matrix for the GP in the model. I can fit to a subset of the data on my laptop (R version 3.4.3 (2017-11-30)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.3) but it takes hours to fit even 1/3 of the data set and I can’t do anything with it after the fact without R crashing (I can look at model output when fitting to just the first 200 rows of the data). I have a lab computer that has a better processor but the same amount of RAM (16GB) and is running 64-bit Windows 10 that won’t even run the sampling (returns “Error: cannot allocate vector of size [N] mb”).

I apologize for my ignorance, but I am not sure whether I can improve performance with my current model specification or whether I need to change entirely the way I am specifying the model. I am using a full distance matrix with the “distance” in hours between all of the data points, so this is the apparent choke point in the model. Is there a better way to define the gaussian process kernel function for this purpose? If not, is there some way to improve performance in the way I have specified the stan model? OR should I just drop the gaussian process idea and try a different way of modeling the relationship between two autocorrelated time-series? Any help or references would be greatly appreciated. Stan code below and data attached. Thanks in advance!

-Connor Dibble

Here’s what I am passing as stan data. The wind series has been centered and standardized.

```
> str(stan_data)
List of 7
 $ N       : int 1920
 $ time    : num [1:1920] 1.46e+09 1.46e+09 1.46e+09 1.46e+09 1.46e+09 ...
 $ currents: num [1:1920] 369 373 377 382 386 ...
 $ wind    : num [1:1920] -0.2113 -0.0332 0.0432 0.0355 0.271 ...
 $ Dmat    : num [1:1920, 1:1920] 0 1 2 3 4 5 6 7 8 9 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:1920] "1" "2" "3" "4" ...
  .. ..$ : chr [1:1920] "1" "2" "3" "4" ...
 $ sim_wind: num [1:100] -25 -24.6 -24.3 -23.9 -23.6 ...
 $ S       : num 100
> <a class="attachment" href="/uploads/mc_stan/original/2X/4/450e80f2007e00472fd2c20b36d17bff37db20a0.R">m1_stan_data.R</a> (475.9 KB)

model.fit <- rstan::sampling(object = model, data = stan_data, chains = 1, iter = 5000,
warmup = 1000, refresh = 250)
```
And here is my stan model:

```
data {
  int<lower=1> N; // number of (time) points in series
  vector[N] currents; // vector of Alongshore current speeds
  vector[N] wind; // vector of wind speeds
  matrix[N,N] Dmat; // Distance matrix (scaled = /3600)
  //prediction
  int<lower=1> S; // length of wind series to simulate over
  vector[S] sim_wind; // vector of wind values to simulate over
}
parameters {
  vector[N] g; // g is the variable intercept offset
  real a; // overall model intercept
  real b; // slope describing relationship to wind predictor
  real<lower=0> sigma_currents;
  real<lower=0> eta_sq;
  real<lower=0> inv_rho_sq;
  real<lower=0> sigma_sq;
}
transformed parameters {
  matrix[N,N] Sigma;
  matrix[N,N] L;
  real<lower=0> rho_sq;
  rho_sq = inv(inv_rho_sq);
  // off-diagonal elements of covariance matrix
  for (i in 1:(N-1)){
      for (j in (i+1):N){
          Sigma[i,j] =  eta_sq * exp(-rho_sq * pow(Dmat[i,j],2));
          Sigma[j,i] = Sigma[i,j];
        }
    }
    // diagonal elements of covariance matrix
    for(k in 1:N) {
      Sigma[k,k] = eta_sq + sigma_sq; // adding jitter to diag elements
    }
  L = cholesky_decompose(Sigma);
}
model {
    vector[N] mu;
    //priors
    eta_sq ~ cauchy(0,1);
    inv_rho_sq ~ cauchy(0,1);
    sigma_sq ~ cauchy(0,1);
    a ~ normal(0, 20);
    b ~ normal(0, 1);
    sigma_currents ~ normal(0, 200); // overall model sigma
    //model
    g ~ multi_normal_cholesky(rep_vector(0, N), L); // model variable intercepts as gaussian process over function Sigma
    for (i in 1:N) {
      mu[i] = a + g[i] + b * wind[i];
    }
    currents ~ normal(mu, sigma_currents);
}
generated quantities{
  vector[S] sim_cur; //simulated currents output
  sim_cur = a + b*sim_wind;
} 
```

m1_stan_data.R (475.9 KB)


#2

There might be some small peformance improvements you can make (ex. I don’t think you need the for loop in your computation of mu), but in general GPs become unfeasible computationally with more than 100 or so sample points. You might want to check out splines as an approximation for such cases.


#3

Actually, do you intend to look at the posterior of g, or is it merely there to model the autocorrelation? If the latter (it looks like you’re more interested in sim_cur), I think you should be able to apply the Kronecker trick Flaxman et al discovered. Maybe @avehtari or @rtrangucci could verify?


#4

Oh, and while I see you’ve attempted to reduce redundant computation by pre-computing the distance matrix, I’ve benchmarked this (and an even more efficient version using only the unique distances) against cov_exp_quad and found that cov_exp_quad is nearly as fast and certainly cleaner/harder-to-mess up.


#5

Hi Mike,

I’m not concerned about the posterior of ‘g’ and you are correct that is only there to model autocorrelation. In fact, I have run a version of this model where the initiation and computation of ‘g’ is within brackets, so defined as a local variable and not saved to RAM, if I understand correctly. I was attempting there to get some performance improvement by dropping g posteriors. As specified, the model is computing NxN posterior distributions for g, right? So it’s isn’t really even feasible to do anything with them.

Thank you for the insights. Splines certainly might be a way to approximate this, and the idea of approximating seems fine in principle, since the autoregressive correlations should not be all that different throughout the series. I’ll take an in depth look at the linked examples you gave. I’m also interested in this Kronecker trick. My goal here is to be able to show the relationship between wind and currents across a range of simulated values so that I can demonstrate a monotonic relationship divorced from the noisy raw data. It seems like a trivial thing when talking about wind and current speed, since we generally have a good intuition about the relationship there, but for subsequent models I think there will be real value to showing a simulation across a range of values instead of trying to get readers to zoom in on noisy raw data to see a relationship.

As far as cov_exp_quad, I’ll try running with that for the GP kernel. Any thoughts on entirely different kernel representations? I’ll get cracking on Rasmussen and Williams as well…

Thank you!


#6

See for the recommend use of GPs in Stan, including cov_exp_quad and a non-centered parameterization, and a discussion of the informative priors you’ll need. The non-centered parameterization in particular should significantly speed up the fit, but dense GPs will always be limited by the burden of having to model correlations between all input points.


#7

And see a recent case study
http://mc-stan.org/users/documentation/case-studies/nngp.html
that I think you can take advantage of because you have an unambiguous ordering through time.


#8

Given your goals, have you considered an SDE instead of a GP? A linear SDE system would be able to handle those time series size without a (execution time and ram) problem (mainly because it only considers pairs of subsequent and not all time points).


#9

I really appreciate everyone’s input. I will hit the books and try to implement some of these potential solutions. I’ll post back here with the results.

Thank you for all of your great suggestions!


#10

Hi there,

I was wondering if there was a link or resource that you had intended to include in your post? I am trying to see if I can get to where I can fit this model as a GP using non-centered parameterization and improving my priors, as you mentioned before I try estimating it using splines. If you had a reference in mind, I’d love to know about it. Otherwise I’ll keep sifting around.

Cheers,

Connor


#11

Whoops, my bad. https://betanalpha.github.io/assets/case_studies/gp_part3/part3.html.


#12

Hey Mike,

I have been trying to find an example of this Kronecker trick applied and the ones I can find (in @rtrangucci 's github repository) seem to incorporate responses over a 2+D grid. I just want the GP over a single (time) dimension. Are you aware as to whether I can apply the same method in a single dimension? This model:

seems close to what I want (with D = 1). If so, then I guess the Kronecker trick basically comes down to this line?

    latent_gp = L_K * y_tilde * diag_pre_multiply(alphas, L_Omega)';

It seems like the scale of K hasn’t changed. Would efficiency gains related to the separability of covariance structures only apply if we’re talking about computational limits arising from multiple outputs over some grid?

When I run the above model with my data (just modeling the latent gp, leaving out the wind predictor), I still estimate two variables, latent_gp and y_tilde, every time I take a draw from the posterior. As such, I could use some clarification on how or whether the Kronecker trick can reduce the computational load of the model I’m aiming for. Any thoughts would be greatly appreciated.

Cheers,

Connor


#13

I am not aware of how a Kronecker covariance matrix could be derived from purely 1-dimensional input. Limiting the bandwidth of a typical covariance function could result in a banded covariance matrix which could be inverted much more quickly, but that’s a modeling choice.


#14

functions {
matrix approx_L(int M, real scale, vector x, real l) {
int N = rows(x);

    real epsilon = 1 / (sqrt2() * l);
    real alpha = 1 / scale;
    real beta = (1.0 + (2.0 * epsilon / alpha)^2)^0.25;
    real delta =  alpha* sqrt((beta^2 - 1.0) / 2.0);
    vector[N] Ht[M];
    matrix[N, M] Htt;
    vector[N] xp = alpha * beta * x;
    real fsq = epsilon^2 / (alpha^2 + delta^2 + epsilon^2);
    real f = sqrt(fsq);
    Ht[1] = sqrt(sqrt(alpha^2 / (alpha^2 + delta^2 + epsilon^2))) * sqrt(beta) * exp(-delta^2 * x .* x);
    Ht[2] = f * sqrt(2.0) * xp .* Ht[1];
    for(n in 3:M) {
      Ht[n] = f * sqrt(2.0 / (n - 1.0)) * xp .* Ht[n - 1] - fsq * sqrt((n - 2.0) / (n - 1.0)) * Ht[n - 2];
    }

    for(n in 1:M) {
      Htt[:, n] = Ht[n];
    }
    return  Htt;
  }
}
data {
  int<lower=1> N; // number of (time) points in series
  vector[N] currents; // vector of Alongshore current speeds
  vector[N] wind; // vector of wind speeds
  //matrix[N,N] Dmat; // Distance matrix (scaled = /3600)

  int scale;  // set to 0.2
  vector[N] Distance;
  int<lower=1> M;  // approx points

  //prediction
  int<lower=1> S; // length of wind series to simulate over
  vector[S] sim_wind; // vector of wind values to simulate over
}
parameters {
  vector[M] g; // g is the variable intercept offset
  real a; // overall model intercept
  real b; // slope describing relationship to wind predictor
  real<lower=0> sigma_currents;
  real<lower=0> eta_sq;
  real<lower=0> inv_rho_sq;
  real<lower=0> sigma_sq;
}

transformed parameters {
  real<lower=0> rho_sq = inv(inv_rho_sq);
}

model {
    vector[N] mu;
    //priors
    eta_sq ~ cauchy(0,1);
    inv_rho_sq ~ cauchy(0,1);
    sigma_sq ~ cauchy(0,1);
    a ~ normal(0, 5);
    b ~ normal(0, 1);
    sigma_currents ~ normal(0, 10); // overall model sigma
    //model
    g ~ normal(0, 1);
    mu = a + approx_L(M, scale, Distance, inv_rho_sq) * g + b * wind;
    currents ~ normal(mu, sigma_currents);
}
generated quantities{
  vector[S] sim_cur; //simulated currents output
  sim_cur = a + b*sim_wind;
}

You can try this one. I modified the approx_L in, the approx_L is
from bbales2.

Choose M to something reasonable, since your frequencies are low, M doesn’t need to be very high.
Nyquist criterion.

Added, the mean of Distance should be zero. Not done in my code.

The approximation will work much better if the mean of the independent variable x is zero! The approximate kernels are not translation invariant or stationary or however you think about it! Here’s a picture of the first ten eigenfunctions (scaled by the square root of their eigenvalues). The approximation will fail if there is data out near the boundaries (< ~-1.2 or > ~1.2) cause all the eigenfunctions are near zero there:


#15

Since you are working with one-dimensional GPs you might be interested in the work that has gone into converting GPs with particular choices of kernel into Kalman filters.
The original work is by Simo Särkkä’s group, e.g.
J. Hartikainen and S. Särkkä, “Kalman filtering and smoothing solutions to temporal Gaussian process regression models,” in 2010 IEEE International Workshop on Machine Learning for Signal Processing, 2010, pp. 379–384.

but I also find this work from Oxford to be quite neat:
https://arxiv.org/abs/1510.02830


Gaussian process roadmap
#16

Here is also a blogpost with stan code

I only have one question. Can we somehow make it scale as linear and not quadratic?


#17

Good question.

Do I see it right? SDE is slower.

http://www.juhokokkala.fi/blog/posts/linear-time-evaluation-of-matern-32-gp-density-in-stan-using-the-sde-representation/

Inference for Stan model: Matern32Poisson_basicGP_reparametrized_model
1 chains: each with iter=(2000000); warmup=(0); thin=(1); 2000000 iterations saved.

Warmup took (30) seconds, 30 seconds total
Sampling took (2886) seconds, 48 minutes total

Inference for Stan model: Matern32Poisson_SDE_reparametrized_model
1 chains: each with iter=(2000000); warmup=(0); thin=(1); 2000000 iterations saved.

Warmup took (26) seconds, 26 seconds total
Sampling took (3629) seconds, 1.0 hours total


#18

That is for 5 points.

For 100 points

Inference for Stan model: Matern32Poisson_basicGP_model
1 chains: each with iter=(1000); warmup=(0); thin=(1); 1000 iterations saved.

Warmup took (2458) seconds, 41 minutes total
Sampling took (1827) seconds, 30 minutes total


Inference for Stan model: Matern32Poisson_SDE_model
1 chains: each with iter=(1000); warmup=(0); thin=(1); 1000 iterations saved.

Warmup took (127) seconds, 2.1 minutes total
Sampling took (105) seconds, 1.7 minutes total


Inference for Stan model: Matern32Poisson_basicGP_reparametrized_model
1 chains: each with iter=(1000); warmup=(0); thin=(1); 1000 iterations saved.

Warmup took (1240) seconds, 21 minutes total
Sampling took (1091) seconds, 18 minutes total


Inference for Stan model: Matern32Poisson_SDE_reparametrized_model
1 chains: each with iter=(1000); warmup=(0); thin=(1); 1000 iterations saved.

Warmup took (64) seconds, 1.1 minutes total
Sampling took (56) seconds, 56 seconds total

Not really sure if the properties are right? warmup=0 --> warmup took 40 minutes.


#19

Hey Andre and Ben,

@andre.pfeuffer
@bbbales2

I know this is fairly old, but I am still trying to get this model to work for my purposes. A quick reminder- I am trying to use the approximated GP to model error due to temporal autocorrelation in a model describing the relationship between winds and ocean currents.

I’ve (finally) realized that I am having a problem with the “Distance” vector, x, in the approx_L function. I have been giving the model a single distance vector centered at zero to use for the fit. That vector represents the distance in time between sampling points, but following earlier advice it is demeaned (so that the mean is zero). What is happening is that the error term is really only valid for the middle of the data set, because that is where distance is zero due to the demeaning procedure. Thus, the distance to the other data points is based on the exactly middle data point’s location in time.

The posterior distribution of the parameters (there are M of them), ‘g’, that describes the variable intercept due to autocorrelation error ends up looking a lot like zero. But if you pull the GP out and do some inference there, you see that it is near zero at the extremes and has non-zero values right around the data points nearest the center of the dataset, i.e., where the distance vector is accurately describing distance to other points. So the GP approximated error seems to work, but only where the distance vector accurately represents the relationship of the current data point to all of the other data points. Here is a plot showing the mean posterior value of ‘gp’ where

gp = approx_L(M, scale, Distance, inv_rho_sq) * g;

What it seems like I need to do is update the distance vector that I feed to approx_L(), which is called xt in Ben’s original function, such that it is zero for the present data point, rather than setting the middle of the data set as the zero position. Does that make any sense? So I think I need to adapt approx_L() to deal with a different distance vector depending on the data point for which I’m estimating an error term.

Does any of this seem to make sense? Any ideas on how to make that work in the stan code?

I appreciate your thoughts!

Here’s my model:

functions {
matrix approx_L(int M, real scale, vector x, real l) {
    int N = rows(x); 
    real epsilon = 1 / (sqrt2() * l);
    real alpha = 1 / scale;
    real beta = (1.0 + (2.0 * epsilon / alpha)^2)^0.25;
    real delta =  alpha* sqrt((beta^2 - 1.0) / 2.0);
    vector[N] Ht[M];
    matrix[N, M] Htt;
    vector[N] xp = alpha * beta * x;
    real fsq = epsilon^2 / (alpha^2 + delta^2 + epsilon^2);
    real f = sqrt(fsq);
    Ht[1] = sqrt(sqrt(alpha^2 / (alpha^2 + delta^2 + epsilon^2))) * sqrt(beta) * exp(-delta^2 * x .* x);
    Ht[2] = f * sqrt(2.0) * xp .* Ht[1];
    for(n in 3:M) {
      Ht[n] = f * sqrt(2.0 / (n - 1.0)) * xp .* Ht[n - 1] - fsq * sqrt((n - 2.0) / (n - 1.0)) * Ht[n - 2];
    }

    for(n in 1:M) {
      Htt[:, n] = Ht[n];
    }
    return  Htt;
  }
}
data {
  // model parameters
  int<lower=1> N; // number of (time) points in series
  int<lower=0> K; // number of predictos
  matrix[N, K] X; // predictor design matrix
  vector[N] response; // vector of Alongshore current speeds

  // GP apprximation parameters (modeling error as approx GP)
  real scale;  // set to 0.2
  vector[N] Distance;
  int<lower=1> M;  // number of eigenvectors of var-covar matrix to retain to approximate the GP

  //prediction
  int<lower=1> S; // length of wind series to simulate over
  int<lower=0> sK; // number of predictors for simulation
  matrix[S, sK] SIM; // matrix of ismulated values
}
parameters {
  vector[M] g; // g is the variable intercept offset
  vector[K] B; // coefficients of the model (both intercepts and slopes)
  real<lower=0> sigma_response;
  // real<lower=0> eta_sq;
  real<lower=0> inv_rho_sq;
  // real<lower=0> sigma_sq;
}

transformed parameters {
  real<lower=0> rho_sq = inv(inv_rho_sq);

  vector[N] gp; // gaussian process approximation; represents temporal autocorrelation
  gp = approx_L(M, scale, Distance, inv_rho_sq) * g;

}

model{
    vector[N] mu;
    //priors
    // eta_sq ~ cauchy(0,1);
    inv_rho_sq ~ cauchy(0,1);
    // sigma_sq ~ cauchy(0,1);
    B[1] ~ normal(0,5); // intercept prior
    B[2:K] ~ normal(0,1); // slope priors
    g ~ normal(0, 1);


    sigma_response ~ normal(0, 5); // overall model sigma; 5 is probably too big for the variance hyperparameter
    //model
    mu = X * B + gp;
    response ~ normal(mu, sigma_response);
}
generated quantities{
  vector[S] sim_cur; //simulated response output
  vector[N] mu_pred;
  vector[N] pred_cur;
  vector[N] log_lik;
  vector[N] resids;
    sim_cur = SIM * B; // simulated response across the range of params
    mu_pred = X*B + gp; //+ approx_L(M, scale, Distance, inv_rho_sq) * g;
  for(i in 1:N){
    pred_cur[i] = normal_rng(mu_pred[i], sigma_response); // posterior predictions of the model
    log_lik[i] = normal_lpdf(response[i] | mu_pred[i], sigma_response); // compute log likelihood for us in AIC 
  }  
  resids = response - mu_pred;
}

#20

Take a look at:

matrix[P, M] L = approx_L(M, scale, xp, sigma, l);
log10error = log10(max(fabs(cov_exp_quad(xp, sigma, l) - L * L')));

from:
https://github.com/bbbales2/basketball/blob/master/models/approx_bernoulli_gp_fixed_sigma.stan

(and ignore x and xp are messed in the source). So the approx_L needs the absolute (centered) points and not the distance.

I use the following function:

normalize <- function(x) {
  x <- as.matrix(x)
  minAttr=apply(x, 2, min)
  maxAttr=apply(x, 2, max)
  x <- sweep(x, 2, minAttr, FUN="-")
  x=sweep(x, 2,  maxAttr-minAttr, "/")
  attr(x, 'normalized:min') = minAttr
  attr(x, 'normalized:max') = maxAttr
  return (x)
}

ddate.num <- normalize(as.numeric(x$dates))

Added: have to subtract 0.5 to be mean centered

When making predictions using cov_exp_quad in generated_quantities doesn’t require
derivatives, so it will not hit the performance. Copied the following from a k = 1,…, N, 2 dimensionals GP, I’ve written one year ago. Its needs modification, but clearly shows the idea:

int N1 = s1[k];
int N2 = s2[k];
matrix[N1, N2] k_x1_x2 = cov_exp_quad(to_array_1d(ddate1[pos1:(pos1 + N1 - 1)])  // N1
                           , to_array_1d(ddate2[pos2:(pos2 + N2 - 1)])  // N2
                           , 1.0, l);
matrix[N1, N1] K = cov_exp_quad(
      to_array_1d(ddate1[pos1:(pos1 + N1 - 1)]), 1.0, l);
matrix[N1, M[k]] L;
matrix[N1, 2] t;
matrix[N1, N1] L_K;
matrix[N2, 2] t_hat;
vector[N1] K_div_y1;
vector[N2] tmp;
for (n in 1:N1)
  K[n, n] = K[n,n] + 1e-14;
L_K = cholesky_decompose(K);
L = approx_L(M[k], scale, ddate1[pos1:(pos1 + N1 - 1)], l);

t = L * err[IndexStarts_knots[k]:IndexEnds_knots[k]];

for(i in 1:2) {
  K_div_y1 = mdivide_left_tri_low(L_K, t[, i]);
  K_div_y1 = mdivide_right_tri_low(K_div_y1', L_K)';
  t_hat[, i] = k_x1_x2' * K_div_y1;
}