I really appreciate everyone’s input. I will hit the books and try to implement some of these potential solutions. I’ll post back here with the results.

Thank you for all of your great suggestions!

I really appreciate everyone’s input. I will hit the books and try to implement some of these potential solutions. I’ll post back here with the results.

Thank you for all of your great suggestions!

Hi there,

I was wondering if there was a link or resource that you had intended to include in your post? I am trying to see if I can get to where I can fit this model as a GP using non-centered parameterization and improving my priors, as you mentioned before I try estimating it using splines. If you had a reference in mind, I’d love to know about it. Otherwise I’ll keep sifting around.

Cheers,

Connor

Hey Mike,

I have been trying to find an example of this Kronecker trick applied and the ones I can find (in @rtrangucci 's github repository) seem to incorporate responses over a 2+D grid. I just want the GP over a single (time) dimension. Are you aware as to whether I can apply the same method in a single dimension? This model:

seems close to what I want (with D = 1). If so, then I guess the Kronecker trick basically comes down to this line?

```
latent_gp = L_K * y_tilde * diag_pre_multiply(alphas, L_Omega)';
```

It seems like the scale of K hasn’t changed. Would efficiency gains related to the separability of covariance structures only apply if we’re talking about computational limits arising from multiple outputs over some grid?

When I run the above model with my data (just modeling the latent gp, leaving out the wind predictor), I still estimate two variables, latent_gp and y_tilde, every time I take a draw from the posterior. As such, I could use some clarification on how or whether the Kronecker trick can reduce the computational load of the model I’m aiming for. Any thoughts would be greatly appreciated.

Cheers,

Connor

1 Like

I am not aware of how a Kronecker covariance matrix could be derived from purely 1-dimensional input. Limiting the bandwidth of a typical covariance function could result in a banded covariance matrix which could be inverted much more quickly, but that’s a modeling choice.

functions {

matrix approx_L(int M, real scale, vector x, real l) {

int N = rows(x);

```
real epsilon = 1 / (sqrt2() * l);
real alpha = 1 / scale;
real beta = (1.0 + (2.0 * epsilon / alpha)^2)^0.25;
real delta = alpha* sqrt((beta^2 - 1.0) / 2.0);
vector[N] Ht[M];
matrix[N, M] Htt;
vector[N] xp = alpha * beta * x;
real fsq = epsilon^2 / (alpha^2 + delta^2 + epsilon^2);
real f = sqrt(fsq);
Ht[1] = sqrt(sqrt(alpha^2 / (alpha^2 + delta^2 + epsilon^2))) * sqrt(beta) * exp(-delta^2 * x .* x);
Ht[2] = f * sqrt(2.0) * xp .* Ht[1];
for(n in 3:M) {
Ht[n] = f * sqrt(2.0 / (n - 1.0)) * xp .* Ht[n - 1] - fsq * sqrt((n - 2.0) / (n - 1.0)) * Ht[n - 2];
}
for(n in 1:M) {
Htt[:, n] = Ht[n];
}
return Htt;
}
}
data {
int<lower=1> N; // number of (time) points in series
vector[N] currents; // vector of Alongshore current speeds
vector[N] wind; // vector of wind speeds
//matrix[N,N] Dmat; // Distance matrix (scaled = /3600)
int scale; // set to 0.2
vector[N] Distance;
int<lower=1> M; // approx points
//prediction
int<lower=1> S; // length of wind series to simulate over
vector[S] sim_wind; // vector of wind values to simulate over
}
parameters {
vector[M] g; // g is the variable intercept offset
real a; // overall model intercept
real b; // slope describing relationship to wind predictor
real<lower=0> sigma_currents;
real<lower=0> eta_sq;
real<lower=0> inv_rho_sq;
real<lower=0> sigma_sq;
}
transformed parameters {
real<lower=0> rho_sq = inv(inv_rho_sq);
}
model {
vector[N] mu;
//priors
eta_sq ~ cauchy(0,1);
inv_rho_sq ~ cauchy(0,1);
sigma_sq ~ cauchy(0,1);
a ~ normal(0, 5);
b ~ normal(0, 1);
sigma_currents ~ normal(0, 10); // overall model sigma
//model
g ~ normal(0, 1);
mu = a + approx_L(M, scale, Distance, inv_rho_sq) * g + b * wind;
currents ~ normal(mu, sigma_currents);
}
generated quantities{
vector[S] sim_cur; //simulated currents output
sim_cur = a + b*sim_wind;
}
```

You can try this one. I modified the approx_L in, the approx_L is

from **bbales2**.

Choose M to something reasonable, since your frequencies are low, M doesn’t need to be very high.

Nyquist criterion.

Added, the mean of Distance should be zero. Not done in my code.

The approximation will work much better if the mean of the independent variable x is zero! The approximate kernels are not translation invariant or stationary or however you think about it! Here’s a picture of the first ten eigenfunctions (scaled by the square root of their eigenvalues). The approximation will fail if there is data out near the boundaries (< ~-1.2 or > ~1.2) cause all the eigenfunctions are near zero there:

1 Like

Since you are working with one-dimensional GPs you might be interested in the work that has gone into converting GPs with particular choices of kernel into Kalman filters.

The original work is by Simo Särkkä’s group, e.g.

J. Hartikainen and S. Särkkä, “Kalman filtering and smoothing solutions to temporal Gaussian process regression models,” in 2010 IEEE International Workshop on Machine Learning for Signal Processing, 2010, pp. 379–384.

but I also find this work from Oxford to be quite neat:

https://arxiv.org/abs/1510.02830

2 Likes

Here is also a blogpost with stan code

I only have one question. Can we somehow make it scale as linear and not quadratic?

2 Likes

Good question.

Do I see it right? SDE is slower.

Inference for Stan model: Matern32Poisson_basicGP_reparametrized_model

1 chains: each with iter=(2000000); warmup=(0); thin=(1); 2000000 iterations saved.Warmup took (30) seconds, 30 seconds total

Sampling took (2886) seconds, 48 minutes total

Inference for Stan model: Matern32Poisson_SDE_reparametrized_model

1 chains: each with iter=(2000000); warmup=(0); thin=(1); 2000000 iterations saved.Warmup took (26) seconds, 26 seconds total

Sampling took (3629) seconds, 1.0 hours total

That is for 5 points.

For 100 points

```
Inference for Stan model: Matern32Poisson_basicGP_model
1 chains: each with iter=(1000); warmup=(0); thin=(1); 1000 iterations saved.
Warmup took (2458) seconds, 41 minutes total
Sampling took (1827) seconds, 30 minutes total
Inference for Stan model: Matern32Poisson_SDE_model
1 chains: each with iter=(1000); warmup=(0); thin=(1); 1000 iterations saved.
Warmup took (127) seconds, 2.1 minutes total
Sampling took (105) seconds, 1.7 minutes total
Inference for Stan model: Matern32Poisson_basicGP_reparametrized_model
1 chains: each with iter=(1000); warmup=(0); thin=(1); 1000 iterations saved.
Warmup took (1240) seconds, 21 minutes total
Sampling took (1091) seconds, 18 minutes total
Inference for Stan model: Matern32Poisson_SDE_reparametrized_model
1 chains: each with iter=(1000); warmup=(0); thin=(1); 1000 iterations saved.
Warmup took (64) seconds, 1.1 minutes total
Sampling took (56) seconds, 56 seconds total
```

Not really sure if the properties are right? warmup=0 --> warmup took 40 minutes.

Hey Andre and Ben,

I know this is fairly old, but I am still trying to get this model to work for my purposes. A quick reminder- I am trying to use the approximated GP to model error due to temporal autocorrelation in a model describing the relationship between winds and ocean currents.

I’ve (finally) realized that I am having a problem with the “Distance” vector, x, in the approx_L function. I have been giving the model a single distance vector centered at zero to use for the fit. That vector represents the distance in time between sampling points, but following earlier advice it is demeaned (so that the mean is zero). What is happening is that the error term is really only valid for the middle of the data set, because that is where distance is zero due to the demeaning procedure. Thus, the distance to the other data points is based on the exactly middle data point’s location in time.

The posterior distribution of the parameters (there are M of them), ‘g’, that describes the variable intercept due to autocorrelation error ends up looking a lot like zero. But if you pull the GP out and do some inference there, you see that it is near zero at the extremes and has non-zero values right around the data points nearest the center of the dataset, i.e., where the distance vector is accurately describing distance to other points. So the GP approximated error seems to work, but only where the distance vector accurately represents the relationship of the current data point to all of the other data points. Here is a plot showing the mean posterior value of ‘gp’ where

`gp = approx_L(M, scale, Distance, inv_rho_sq) * g;`

What it seems like I need to do is update the distance vector that I feed to approx_L(), which is called xt in Ben’s original function, such that it is zero for the present data point, rather than setting the middle of the data set as the zero position. Does that make any sense? So I think I need to adapt approx_L() to deal with a different distance vector depending on the data point for which I’m estimating an error term.

Does any of this seem to make sense? Any ideas on how to make that work in the stan code?

I appreciate your thoughts!

Here’s my model:

```
functions {
matrix approx_L(int M, real scale, vector x, real l) {
int N = rows(x);
real epsilon = 1 / (sqrt2() * l);
real alpha = 1 / scale;
real beta = (1.0 + (2.0 * epsilon / alpha)^2)^0.25;
real delta = alpha* sqrt((beta^2 - 1.0) / 2.0);
vector[N] Ht[M];
matrix[N, M] Htt;
vector[N] xp = alpha * beta * x;
real fsq = epsilon^2 / (alpha^2 + delta^2 + epsilon^2);
real f = sqrt(fsq);
Ht[1] = sqrt(sqrt(alpha^2 / (alpha^2 + delta^2 + epsilon^2))) * sqrt(beta) * exp(-delta^2 * x .* x);
Ht[2] = f * sqrt(2.0) * xp .* Ht[1];
for(n in 3:M) {
Ht[n] = f * sqrt(2.0 / (n - 1.0)) * xp .* Ht[n - 1] - fsq * sqrt((n - 2.0) / (n - 1.0)) * Ht[n - 2];
}
for(n in 1:M) {
Htt[:, n] = Ht[n];
}
return Htt;
}
}
data {
// model parameters
int<lower=1> N; // number of (time) points in series
int<lower=0> K; // number of predictos
matrix[N, K] X; // predictor design matrix
vector[N] response; // vector of Alongshore current speeds
// GP apprximation parameters (modeling error as approx GP)
real scale; // set to 0.2
vector[N] Distance;
int<lower=1> M; // number of eigenvectors of var-covar matrix to retain to approximate the GP
//prediction
int<lower=1> S; // length of wind series to simulate over
int<lower=0> sK; // number of predictors for simulation
matrix[S, sK] SIM; // matrix of ismulated values
}
parameters {
vector[M] g; // g is the variable intercept offset
vector[K] B; // coefficients of the model (both intercepts and slopes)
real<lower=0> sigma_response;
// real<lower=0> eta_sq;
real<lower=0> inv_rho_sq;
// real<lower=0> sigma_sq;
}
transformed parameters {
real<lower=0> rho_sq = inv(inv_rho_sq);
vector[N] gp; // gaussian process approximation; represents temporal autocorrelation
gp = approx_L(M, scale, Distance, inv_rho_sq) * g;
}
model{
vector[N] mu;
//priors
// eta_sq ~ cauchy(0,1);
inv_rho_sq ~ cauchy(0,1);
// sigma_sq ~ cauchy(0,1);
B[1] ~ normal(0,5); // intercept prior
B[2:K] ~ normal(0,1); // slope priors
g ~ normal(0, 1);
sigma_response ~ normal(0, 5); // overall model sigma; 5 is probably too big for the variance hyperparameter
//model
mu = X * B + gp;
response ~ normal(mu, sigma_response);
}
generated quantities{
vector[S] sim_cur; //simulated response output
vector[N] mu_pred;
vector[N] pred_cur;
vector[N] log_lik;
vector[N] resids;
sim_cur = SIM * B; // simulated response across the range of params
mu_pred = X*B + gp; //+ approx_L(M, scale, Distance, inv_rho_sq) * g;
for(i in 1:N){
pred_cur[i] = normal_rng(mu_pred[i], sigma_response); // posterior predictions of the model
log_lik[i] = normal_lpdf(response[i] | mu_pred[i], sigma_response); // compute log likelihood for us in AIC
}
resids = response - mu_pred;
}
```

Take a look at:

```
matrix[P, M] L = approx_L(M, scale, xp, sigma, l);
log10error = log10(max(fabs(cov_exp_quad(xp, sigma, l) - L * L')));
```

from:

https://github.com/bbbales2/basketball/blob/master/models/approx_bernoulli_gp_fixed_sigma.stan

(and ignore x and xp are messed in the source). So the approx_L needs the absolute (centered) points and not the distance.

I use the following function:

```
normalize <- function(x) {
x <- as.matrix(x)
minAttr=apply(x, 2, min)
maxAttr=apply(x, 2, max)
x <- sweep(x, 2, minAttr, FUN="-")
x=sweep(x, 2, maxAttr-minAttr, "/")
attr(x, 'normalized:min') = minAttr
attr(x, 'normalized:max') = maxAttr
return (x)
}
```

`ddate.num <- normalize(as.numeric(x$dates))`

Added: have to subtract 0.5 to be mean centered

When making predictions using cov_exp_quad in generated_quantities doesn’t require

derivatives, so it will not hit the performance. Copied the following from a k = 1,…, N, 2 dimensionals GP, I’ve written one year ago. Its needs modification, but clearly shows the idea:

```
int N1 = s1[k];
int N2 = s2[k];
matrix[N1, N2] k_x1_x2 = cov_exp_quad(to_array_1d(ddate1[pos1:(pos1 + N1 - 1)]) // N1
, to_array_1d(ddate2[pos2:(pos2 + N2 - 1)]) // N2
, 1.0, l);
matrix[N1, N1] K = cov_exp_quad(
to_array_1d(ddate1[pos1:(pos1 + N1 - 1)]), 1.0, l);
matrix[N1, M[k]] L;
matrix[N1, 2] t;
matrix[N1, N1] L_K;
matrix[N2, 2] t_hat;
vector[N1] K_div_y1;
vector[N2] tmp;
for (n in 1:N1)
K[n, n] = K[n,n] + 1e-14;
L_K = cholesky_decompose(K);
L = approx_L(M[k], scale, ddate1[pos1:(pos1 + N1 - 1)], l);
t = L * err[IndexStarts_knots[k]:IndexEnds_knots[k]];
for(i in 1:2) {
K_div_y1 = mdivide_left_tri_low(L_K, t[, i]);
K_div_y1 = mdivide_right_tri_low(K_div_y1', L_K)';
t_hat[, i] = k_x1_x2' * K_div_y1;
}
```

2 Likes

Yeah what @andre.pfeuffer said it right. To do this you gotta center things and then make sure the scale isn’t so big. This is the annoying thing about the approximation, the GP only cares about distances between points, but the approximation ends up caring about the absolute position.

If I’m remembering correctly how far the points can be from zero and be accurately described is determined by a combo of the length scale (larger length scale, and you can be farther away) and number of basis functions (more basis functions, and you can be farther away). And then there’s that scale parameter that I have forgotten the details of but is important :D.

I think the easiest way to figure out if you’re okay or not is to plot a few covariance matrices, one computed with the approximation and one with the real thing and compare. Try various length scales and scales and see what happens. It’s extra work, but that’s the cost of this kind of expansion :P.

So center those numbers and scale them a little.

If your xs are [-1000, 0], then just rescale them to [-0.5, 0.5] or something (like @andre.pfeuffer 's code does) , and then just remember your length scale changed units.

Probably too late to be helpful, but a) I also would have thought an SDE would make more sense than a GP with a long series of time structured data (but interested in any opinions to the contrary!), b) the currents look a fair bit like a sunspots example I replicated from elsewhere, with a 2nd order linear sde + moving average measurement model – carma(2,1). In case it’s of interest:

```
install.packages("devtools")
library(devtools)
install_github("cdriveraus/ctsem")
library(ctsem)
sunspots<-sunspot.year
sunspots<-sunspots[50: (length(sunspots) - (1988-1924))]
id <- 1
time <- 1749:1924
datalong <- cbind(id, time, sunspots)
#setup model
ssmodel <- ctModel(type='stanct', n.latent=2, n.manifest=1,
manifestNames='sunspots',
latentNames=c('ss_level', 'ss_velocity'),
LAMBDA=matrix(c( 1, 'ma1' ), nrow=1, ncol=2),
DRIFT=matrix(c(0, 'a21', 1, 'a22'), nrow=2, ncol=2),
MANIFESTMEANS=matrix(c('m1'), nrow=1, ncol=1),
CINT=matrix(c(0, 0), nrow=2, ncol=1),
T0VAR=matrix(c(1,0,0,1), nrow=2, ncol=2), #Because single subject
DIFFUSION=matrix(c(0, 0, 0, "diffusion"), ncol=2, nrow=2))
ssmodel$pars$indvarying<-FALSE #Because single subject
ssmodel$pars$offset[14]<- 44 #Because not mean centered
ssmodel$pars[4,c('transform','offset')]<- c(1,0) #To avoid multi modality
#fit
ssfit <- ctStanFit(datalong, ssmodel, iter=300, chains=2)
#output
summary(ssfit)
ctKalman(ssfit,timestep = .02,kalmanvec=c('y','ysmooth'),plot=TRUE)
par(mfrow=c(3,3))
ctStanPostPredict(ssfit,diffsize=c(5,20),wait=FALSE)
```

1 Like

Ok, but we have to check the stationary of the time series first. Maybe we need to take the

differences and what makes us sure, that an ARMA(2,1) is appropriate? He mentioned he used a filter. Before applying a filter, one has to detrend. However the model fitting an intercept.

```
g ~ multi_normal_cholesky(rep_vector(0, N), L); // model variable intercepts as gaussian process over function Sigma
for (i in 1:N) {
mu[i] = a + g[i] + b * wind[i];
}
currents ~ normal(mu, sigma_currents);
```

The graphic in the first post looks to me showing mean centered time-series. Another question,

do both time series share the same length-scale and/or variance?

@mike-lawrence had already a good idea, to use the Kronecker trick.

Using the GP approximation its better to multiply the covariance matrix directly.

I’d do something like this:

```
y_hat = approx_L(....) * err * transpose(cholesky_decompose(Sigma));
wind ~ normal(y_hat[,1], sigma_wind);
current ~ normal(y_hat[,2], sigma_current);
```

And following @bbbales2 scaling advise.

Sorry I wasn’t trying to imply that a carma(2,1) was necessarily appropriate! There’s of course a lot more to modelling decisions than 2 minutes reading a post :) Just had the example to hand.

Sorry if I sounded offensive, not intended. I’m not a native speaker.

1 Like

What are the units on this Distance, by the way? Is this Gaussian process a function of time?

If so then yeah, the other stuff in this thread (what @Charles_Driver and @andre.pfeuffer and @ahartikainen are talking about) might make more sense than doing an expansion.

The issue with time variables is that for the expansion we want the domain of the GP to all be centered around the point where we’re going to do the expansion. Time variables tend to not have a nice center to build this expansion around :D.

Yes the gaussian process is a function of time. The time series is a continuous hourly vector of wind speeds fit to a continuous hourly vector of current speeds.

I dug into your Westbrook basketball example a bit more and in looking at the data set for that, I see what you mean about centering around a point. For that example, the “distance” is also essentially time, but relative to the midpoint of the game clock. So you are able to provide a distance from the middle of the game for all of the data points. I tried to feed in a distance from the middle of my own time series, but the “Distance” vector going into the GP approximation looks like (-1449, -1448, -1447, … , 0, 1, 2, 3, … 1499). So it sounds like that isn’t necessarily the wrong approach, but that it needs to be scaled and then the hyperparameters going into the approximation/expansion need to be adjusted as well.

I will play with this- try to get @andre.pfeuffer normalization routine to work with my data and then plot out the “L” approximation and the full GP covariance matrices with a few different values for the length scale and scale parameters.

By the way, regarding @Charles_Driver’s response- thank you for the suggestions. I really don’t know much at all about SDEs, but I will take a look at the option. I’m not familiar with your package and the ctModel function, but I will check it out. BTW, yes the currents that are fit in this model are spectrally de-tided to remove tidal frequencies and butterworth filtered to remove high-frequency noise; while I removed the mean from the series to perform those manipulations, I added the mean back to the series before fitting to the model.

Ah, okay, I’d skip the expansion thing then. Looking at the plot in your first post, there’s just a lot of wiggles there, and that’s gonna take a ton of basis functions to represent. Have a look at the stuff @Bonnevie suggested above, I think.