Using GP for autocorrelation in a time-series model: memory pressure, GP kernel definition

bbbales2 · August 16, 2018, 5:45am

Yeah what @andre.pfeuffer said it right. To do this you gotta center things and then make sure the scale isn’t so big. This is the annoying thing about the approximation, the GP only cares about distances between points, but the approximation ends up caring about the absolute position.

If I’m remembering correctly how far the points can be from zero and be accurately described is determined by a combo of the length scale (larger length scale, and you can be farther away) and number of basis functions (more basis functions, and you can be farther away). And then there’s that scale parameter that I have forgotten the details of but is important :D.

I think the easiest way to figure out if you’re okay or not is to plot a few covariance matrices, one computed with the approximation and one with the real thing and compare. Try various length scales and scales and see what happens. It’s extra work, but that’s the cost of this kind of expansion :P.

So center those numbers and scale them a little.

If your xs are [-1000, 0], then just rescale them to [-0.5, 0.5] or something (like @andre.pfeuffer 's code does) , and then just remember your length scale changed units.

Charles_Driver · August 16, 2018, 9:07am

Probably too late to be helpful, but a) I also would have thought an SDE would make more sense than a GP with a long series of time structured data (but interested in any opinions to the contrary!), b) the currents look a fair bit like a sunspots example I replicated from elsewhere, with a 2nd order linear sde + moving average measurement model – carma(2,1). In case it’s of interest:

install.packages("devtools")
library(devtools)
install_github("cdriveraus/ctsem")
library(ctsem)


 sunspots<-sunspot.year
 sunspots<-sunspots[50: (length(sunspots) - (1988-1924))]
 id <- 1
 time <- 1749:1924
datalong <- cbind(id, time, sunspots)

#setup model
 ssmodel <- ctModel(type='stanct', n.latent=2, n.manifest=1, 
  manifestNames='sunspots', 
  latentNames=c('ss_level', 'ss_velocity'),
   LAMBDA=matrix(c( 1, 'ma1' ), nrow=1, ncol=2),
   DRIFT=matrix(c(0, 'a21', 1, 'a22'), nrow=2, ncol=2),
   MANIFESTMEANS=matrix(c('m1'), nrow=1, ncol=1),
   CINT=matrix(c(0, 0), nrow=2, ncol=1),
   T0VAR=matrix(c(1,0,0,1), nrow=2, ncol=2), #Because single subject
   DIFFUSION=matrix(c(0, 0, 0, "diffusion"), ncol=2, nrow=2))
 
 ssmodel$pars$indvarying<-FALSE #Because single subject
 ssmodel$pars$offset[14]<- 44 #Because not mean centered
 ssmodel$pars[4,c('transform','offset')]<- c(1,0) #To avoid multi modality 

#fit
ssfit <- ctStanFit(datalong, ssmodel, iter=300, chains=2)

#output
summary(ssfit)

ctKalman(ssfit,timestep = .02,kalmanvec=c('y','ysmooth'),plot=TRUE)
par(mfrow=c(3,3))
ctStanPostPredict(ssfit,diffsize=c(5,20),wait=FALSE)

andre.pfeuffer · August 16, 2018, 9:50am

Ok, but we have to check the stationary of the time series first. Maybe we need to take the
differences and what makes us sure, that an ARMA(2,1) is appropriate? He mentioned he used a filter. Before applying a filter, one has to detrend. However the model fitting an intercept.

g ~ multi_normal_cholesky(rep_vector(0, N), L); // model variable intercepts as gaussian process over     function Sigma
    for (i in 1:N) {
      mu[i] = a + g[i] + b * wind[i];
    }
    currents ~ normal(mu, sigma_currents);

The graphic in the first post looks to me showing mean centered time-series. Another question,
do both time series share the same length-scale and/or variance?
@mike-lawrence had already a good idea, to use the Kronecker trick.
Using the GP approximation its better to multiply the covariance matrix directly.

I’d do something like this:

y_hat = approx_L(....) * err * transpose(cholesky_decompose(Sigma));
wind ~ normal(y_hat[,1], sigma_wind);
current ~ normal(y_hat[,2], sigma_current);

And following @bbbales2 scaling advise.

Charles_Driver · August 16, 2018, 9:54am

Sorry I wasn’t trying to imply that a carma(2,1) was necessarily appropriate! There’s of course a lot more to modelling decisions than 2 minutes reading a post :) Just had the example to hand.

andre.pfeuffer · August 16, 2018, 9:56am

Sorry if I sounded offensive, not intended. I’m not a native speaker.

bbbales2 · August 16, 2018, 11:40am

What are the units on this Distance, by the way? Is this Gaussian process a function of time?

If so then yeah, the other stuff in this thread (what @Charles_Driver and @andre.pfeuffer and @ahartikainen are talking about) might make more sense than doing an expansion.

The issue with time variables is that for the expansion we want the domain of the GP to all be centered around the point where we’re going to do the expansion. Time variables tend to not have a nice center to build this expansion around :D.

cdibble · August 16, 2018, 4:56pm

Yes the gaussian process is a function of time. The time series is a continuous hourly vector of wind speeds fit to a continuous hourly vector of current speeds.

I dug into your Westbrook basketball example a bit more and in looking at the data set for that, I see what you mean about centering around a point. For that example, the “distance” is also essentially time, but relative to the midpoint of the game clock. So you are able to provide a distance from the middle of the game for all of the data points. I tried to feed in a distance from the middle of my own time series, but the “Distance” vector going into the GP approximation looks like (-1449, -1448, -1447, … , 0, 1, 2, 3, … 1499). So it sounds like that isn’t necessarily the wrong approach, but that it needs to be scaled and then the hyperparameters going into the approximation/expansion need to be adjusted as well.

I will play with this- try to get @andre.pfeuffer normalization routine to work with my data and then plot out the “L” approximation and the full GP covariance matrices with a few different values for the length scale and scale parameters.

By the way, regarding @Charles_Driver’s response- thank you for the suggestions. I really don’t know much at all about SDEs, but I will take a look at the option. I’m not familiar with your package and the ctModel function, but I will check it out. BTW, yes the currents that are fit in this model are spectrally de-tided to remove tidal frequencies and butterworth filtered to remove high-frequency noise; while I removed the mean from the series to perform those manipulations, I added the mean back to the series before fitting to the model.

bbbales2 · August 16, 2018, 5:13pm

Ah, okay, I’d skip the expansion thing then. Looking at the plot in your first post, there’s just a lot of wiggles there, and that’s gonna take a ton of basis functions to represent. Have a look at the stuff @Bonnevie suggested above, I think.

cdibble · August 16, 2018, 5:18pm

I realize I should be providing more context. I looked at simple autoregressive models initially in this project, but the attraction to the GP came from its flexibility in dealing with the amount of error due to autocorrelation. For this model, under some (physical, real-world, wind) conditions the model is highly autocorrelated (mostly when winds are strong). But when winds are weak, there is less information there to understand currents. At the very least, the length-in-time scale parameter would change between such conditions. That is basically how I ended up here. That said, I don’t really know what I am doing and am humbly open to any suggestions. I have learned a great deal from working through this and especially on this forum, however, so my continued gratitude to all of you.

cdibble · August 16, 2018, 5:25pm

Ok I will check that out. FYI here are some plots showing this model fit using my not-so-great gp approximation (the GP is really only a non-zero model factor in the middle of the series; see plot from yesterday’s post).

The predictors were, in this case, hourly wind as well as a 14-hour simple moving average of wind. Same response variable as before- hourly depth-averaged alongshore currents.

Model fit to data (raw currents in green, filtered currents in red, model predictions in black/blue):

Model predictions across a simulated range of predictors (these are EDIT: somewhat reasonable- the model tends to undershoot current speeds):

Model residuals (through time, autocorrelation function of residuals, spectrogram of residuals):

andre.pfeuffer · August 16, 2018, 11:59pm

Can it be that the current is timely shifted from the wind? The wind starts to blow(?) and then it takes a while unless it has an effect onto the currents?

Is it possible to show a cross-correlation plot between the wind and current time series?
Also auto-correlation plots of both time series?
And a cross-wavelet spectrum plot to show timely dependencies? There is a R package, no need to
understand the math.
And a PACF plot? For the MA part identification.

My 2 cents, we have to put a little more work in the analysis of the data, before suggesting a solution.

bbbales2 · August 17, 2018, 6:03am

That’s true.

Cool plots!

cdibble · August 17, 2018, 5:47pm

Hey @andre.pfeuffer and @bbbales2 ,

Yes you are absolutely right- there is a lag before the effect of an increase in wind is apparent in the currents. There is also a point at which the force of the wind stress is balanced by friction such that the currents are no longer accelerating due to the wind (i.e., for the same wind, the currents are no longer responding). The latter is not too great a concern, though, since we don’t often reach that state or stay there for long. My original plan was to use a couple of wind predictors, one showing the most recent wind observations (on an hourly basis) and the other showing the most useful moving average in order to capture this lag effect. To determine which lags and moving average types are most useful, I fit the model a bunch of times to single predictors of either simple or exponential moving averages at several different lag values and performed AIC checks. The 14 hour and 24 hour simple moving averages consistently performed best according to AIC, but all of that was using the model specification that is currently in question.

Yesterday I fit the model using the normalization function you provided to adjust the “distance” vector such that it is centered and standardized to [-0.5, 0.5]. Interestingly, when I performed a test fit on just the first 500 rows of data, it looked pretty good. The GP component of the model was non-zero and the fit was quite good over those first 500 data points (in the approx_L, I set M = 5, so was retaining just 5 basis functions). Here are some plots to illustrate:

Posterior mean of GP component of the model fit to first 500 points:

Model skill on first 500 points

Unfortunately, when I fit to the first 1000 data points, the GP component is a monotonic increasing function of time and the fit is bad. I’m not sure what is going on. Any ideas as to why this would occur? Everything was the same across both runs, except the number of data points I let the model see.

GP component of model fit to 1000 points (M = 5 still):

Model skill:

Anyway, I am hoping that I can get to the bottom of this and manage to get this spectral approximation to work for me. I have been reading up on the Kalman filter approach and it does seem promising, but obviously I would prefer not to have to go back to the drawing board (at least not before a month from now when I am hopefully presenting some of this work). I am also thinking of trying the full GP but fitting with ML or something to avoid the computation limits. So I’d be curious to hear your thoughts on whether the whole model basis here is misguided or if it is just a matter of finding a way around the computational limitations.

Here are the plots that you have requested (Edit- fixed order of plots):

CCF with x = 1 hour average wind stress & y = 1 hour average current speed

image810×464 12.2 KB
ACF of 1 hour wind stress

image810×464 11.1 KB
PACF of 1 hour wind stress

image810×464 11.4 KB
ACF of 1 hour currents

image810×464 11.3 KB
PACF of 1 hour currents

image810×464 9.92 KB
Cross-wavelet spectrum (d1 = wind, d2 = currents; used R>biwavelet::xwt()

image810×464 136 KB

andre.pfeuffer · August 18, 2018, 1:54am

Dear cdibble,

allow us a bunch of questions.
As written in your first post

I downloaded the data in the first post as shown in the graphic, the wind and current data
have negative values. How are they defined? What are their units?
How is the hourly average calculated?
(Be aware, if you use a moving average, depending on
window size and window position a dependency is created)

So we are going to empirically correlate two time series, where no physical model
involved. This is maybe dangerous.
A common way alternative is to take a model like described here for example:
http://iopscience.iop.org/article/10.1088/1367-2630/15/5/053024
and enrich by parts of our GP. Citing:
It is widely accepted that the PDF of ocean currents follows the Weibull distribution
“Discoveries” like this are very beneficial.

Back to our model. I’m against using a filtered input, the GP already provides that filtering. If you want a more wiggly process, you may want to switch to Matern32, which has been successfully approximated by a post in this forum.

Your graph’s shows many correlations, introduced by what? The filtering, average or the data
itself?
Could you provide a sample (not all) of the data set sampled hourly equidistantly?
That means take the value at 9:00, 10:00, 11:00 and so on.

cdibble · August 18, 2018, 3:34am

Happy to answer any and all questions.

Both currents and winds are actually just the “alongshore” component of the vector describing speed and direction; alongshore positive means “northish” and alongshore negative means “southish”. The wind data, given in meters per second, are processed into hourly averages before I see it, but the sensor actually provides 5 second ensembles of 100 samples (data which I might be able to track down with some effort; Wind Speed and Direction | Bodega Ocean Observing Node). I rotate the wind to the angle of the regional coastline (chosen to be 320 degrees) and then decompose it into its orthogonal components. The alongshore component is of particular interest because it drives the biologically important process of coastal upwelling but also because it drives alongshore currents which dominate in coastal systems because cross-shore momentum is balanced by the land. Which brings me to currents. Those data I have collected using acoustic doppler current profilers (ADCPs) deployed on the seafloor, which measure, in my case, 2 minute ensembles in 1 meter bins from the seafloor to the surface. The data are also in meters per second. I processed the data removing outliers with a standard deviation filter and using linear interpolation to fill in those values. I then detided the data using a function oce::tidem() that fits regression models to the periodic constituents at tidal frequencies. I also butterworth-filtered the data forwards and backwards to avoid a phase shift. Here is my little detideing and butterworth filtering (bf) function:

	detide <- function(df, t = "Time", x = "Signal", bf = FALSE, f.cut = 0.0015, f.type = "low", bf.order = 6){
		tidal.signal <- oce::tidem(t = df[[t]], x = df[[x]])
		df[[paste0(x, '.dt')]] <- df[[x]] - predict(tidal.signal)
		cat("Mean of signal:", mean(df[[x]]), "\n Mean of tidal signal: ", mean(predict(tidal.signal)))
		if(bf == TRUE){
			bf.filt <- signal::butter(bf.order, f.cut, type = f.type)
			ser.demean <- df[[paste0(x, '.dt')]] - mean(df[[paste0(x, '.dt')]])
			df[[paste0(x,'.dtbf')]] <-  signal::filtfilt(bf.filt,
				stats::spec.taper(ser.demean , p = 0.05))
		}
		return(df)
	}

Finally, I grouped my 2-minute ensembles and averaged all the depth bins to get depth-averaged currents. Oh, and somewhere in there I rotated and decomposed into alongshore and cross-shore the current data just like the wind data. Annnnnnddddd, then I grouped it by hours and computed hourly means to match the wind data and to help reduce the size of the dataset for model fitting.

All of that said, I agree with you that putting filtered data in this model is suspect. I see no real defense of the low-pass butterworth filter at all, but to leave the tidal currents in there may require a modification of the model. All of that filtering in the first place was done to help me understand the main process of interest, which is wind accelerated currents driving a coastal jet past a headland causing an eddy to form in the lee of the headland. This whole Stan GP GLM is aimed at finding the right framework to model each of those relationships so that ultimately I can hindcast the speed of the coastal jet (which is an alongshore thing insofar as it matters to this headland eddy) and the magnitude of an eddy index using the ubiquitous and very inexpensive wind data.

It is really an aside at the moment, but the eddy index is comprised of principal components of a multi-dimensional empirical orthogonal function analysis, which was the original motivation for all of that filtering. The data for that EOF was from four other ADCPs deployed in the bay. At risk of oversharing, here’s what the spatial components look like for the leading mode (feel free to comment on whether my little map transform and 3d data presentation makes any sense at all):

Back to the model at hand. The physical relationship between wind stress and currents is well-understood and you are right that it would probably strengthen this model, since the GP could fill in the effects of friction with the coastline and seafloor and other factors that a simple physical would not capture (at the moment, I can’t get that link you left to load, but I’d be keen to see that model and I’ll check it out when my internet connection cooperates). So if that improves the model of wind to coastal jet, great. I suspect finding a way to do the same with the forthcoming model of wind to eddy index will be more challenging, though some formulations for vorticity come to mind that might be an ok place to start.

The correlations seen in the graph in the original post are due to the data itself. If you look at some of these other time-series plots that I have posted to demonstrate model results, you can see the raw and filtered data, which correspond well. The filtered data captures changes at frequencies of interest well, and in the OP plot you can see that they are well-correlated with the wind, especially when winds are strong (that is, alongshore negative, since that is the direction of our winds when they are strong in the spring and summer, which is when these data were collected).

Here is a csv file with the first 500 rows of data. There is an hourly timestamp in the first column, detided and filtered alongshore currents in the second column, then 1 hour alongshore wind stress, 1 hour crossshore wind stress, 14 hour simple moving average alongshore wind stress, and 24 hour simple moving average alongshore wind stress.

depth-averaged-coastal-currents_basicWinds.csv (39.7 KB)

I think that this addresses your questions, but please let me know if I’ve missed anything. I’m looking forward to hashing this out and I appreciate all of the help.

Cheers,
Connor

andre.pfeuffer · August 18, 2018, 4:32am

Rewrote your model, but needs testing. Don’t have the time to do. Read also the comments inside.
Its based on the idea of the Kalman Filter, where sensors have a measurement noise and
state variable regression defines the relationship between the model.

ADDED: If the length scale of two times is different, you may try out two different GPs.

functions {
  matrix chol_upper_2by2(vector sigma, real xsi) {
    matrix[2, 2] chol;
    chol[1, 1] = sigma[1];
    chol[2, 1] = 0.0;
    chol[1, 2] = sigma[2] * xsi;
    chol[2, 2] = sqrt(square(sigma[2]) - square(chol[1, 2]));
    return chol;
  }

matrix approx_L(int M, real scale, vector x, real l) {
    int N = rows(x);
    real epsilon = 1 / (sqrt2() * l);
    real alpha = 1 / scale;
    real beta = (1.0 + (2.0 * epsilon / alpha)^2)^0.25;
    real delta =  alpha* sqrt((beta^2 - 1.0) / 2.0);
    vector[N] Ht[M];
    matrix[N, M] Htt;
    vector[N] xp = alpha * beta * x;
    real fsq = epsilon^2 / (alpha^2 + delta^2 + epsilon^2);
    real f = sqrt(fsq);
    Ht[1] = sqrt(sqrt(alpha^2 / (alpha^2 + delta^2 + epsilon^2))) * sqrt(beta) * exp(-delta^2 * x .* x);
    Ht[2] = f * sqrt(2.0) * xp .* Ht[1];
    for(n in 3:M) {
      Ht[n] = f * sqrt(2.0 / (n - 1.0)) * xp .* Ht[n - 1] - fsq * sqrt((n - 2.0) / (n - 1.0)) * Ht[n - 2];
    }

    for(n in 1:M) {
      Htt[:, n] = Ht[n];
    }
    return  Htt;
  }
}
data {
  // model parameters
  int<lower=1> N; // number of (time) points in series
  int<lower=0> K; // number of predictos
  matrix[N, K] X; // predictor design matrix
  vector[N] wind; // vector of Alongshore current speeds
  vector[N] current; // vector of Alongshore current speeds

  // GP apprximation parameters (modeling error as approx GP)
  real scale;  // set to 0.2
  vector[N] Time;
  int<lower=1> M;  // number of eigenvectors of var-covar matrix to retain to approximate the GP

  //prediction
  int<lower=1> S; // length of wind series to simulate over
  int<lower=0> sK; // number of predictors for simulation
  matrix[S, sK] SIM; // matrix of ismulated values
}
parameters {
  matrix[M, 2] g; // g is the variable intercept offset
  vector[K] B; // coefficients of the model (both intercepts and slopes)
  real<lower=0> sigma_wind;  // measurement error of wind
  real<lower=0> sigma_current; // measurement error of currents
  vector<lower=0>[2] sigma_model;  // model error
  real<lower=0> inv_rho_sq;  // GP length scale
  real xsi_raw;  // raw correlation between current and wind
}

transformed parameters {
  real<lower=0> rho_sq = inv(inv_rho_sq);
  real<lower=-1, upper=1> xsi = tanh(xsi_raw * 2.0);
  matrix[2,2] U_Sigma = chol_upper_2by2(sigma_model, xsi);
  // multivariate GP sharing the same time scale and with correlation between
  matrix[N, 2] gp = approx_L(M, scale, Time, inv_rho_sq) * g * U_Sigma;
}

model{
    sigma_model ~ cauchy(0, 2.5);
    sigma_wind ~ exponential(1.0);
    sigma_current ~ exponential(1.0);
    inv_rho_sq ~ cauchy(0,1);
    xsi_raw ~ normal(0, 1);  // correlation between current and wind
    // sigma_sq ~ cauchy(0,1);

    // try without? the intercept may be handled by the GP!
    B[1] ~ normal(0,5); // intercept prior
    B[2:K] ~ normal(0,1); // slope priors
    to_vector(g) ~ normal(0, 1); // GP errors

    //model
    wind ~ normal(gp[,1], sigma_wind);
    // you may try something current[2:N] ~ to reflect time shift in correlation.
    current ~ normal(gp[,2] + X * B, sigma_current);
}

// TBD:
/*

generated quantities{
  vector[S] sim_cur; //simulated response output
  vector[N] mu_pred;
  vector[N] pred_cur;
  vector[N] log_lik;
  vector[N] resids;
    sim_cur = SIM * B; // simulated response across the range of params
    mu_pred = X*B + gp; //+ approx_L(M, scale, Time, inv_rho_sq) * g;
  for(i in 1:N){
    pred_cur[i] = normal_rng(mu_pred[i], sigma_response); // posterior predictions of the model
    log_lik[i] = normal_lpdf(response[i] | mu_pred[i], sigma_response); // compute log likelihood for us in AIC
  }
  resids = response - mu_pred;
}
*/

bbbales2 · August 20, 2018, 8:17am

I think the issue here is that when you fit it to the first 500 points, you’re only fitting something that has like 3 big lumps, so a few basis functions can fit that (I’m assuming here the blue should fit the red in the “Amag.dtbf.m from A1 depth…” plot).

When you go to 1000 points, you have something with like 7 big lumps, and you just don’t have enough basis functions to really fit that. So I’d move away from the basis function approximation, since the curve you’re fitting has lots and lots of lumps and it’s unlikely that specific approximation is going to be useful when you extend to your full data.

I’ve read this a couple times and it looks like you’re fitting a model like:

y \sim \text{normal}(X \beta + g(t), \sigma)

Where g(t) is a Gaussian process.

If this is correct, you might wanna give http://www.r-inla.org/ a try. Check out the Tokyo or Drivers examples here: http://www.r-inla.org/examples/volume-1 . It might take a little digging around to pick up the lingo, but they do all their inferences with optimizations that’ll be faster than trying to sample this.

If you’re sticking with Stan, then the first thing I’d do is just sample this with a fixed lengthscale, precompute the Cholesky decomposition of this (beware R chol returns an upper triangle), and pass that in as data. Use the non-centered parameterization where you defined a vector of z ~ normal(0, 1) then multiply them by the Cholesky factor and standard deviation to get your actual mean zero GP.

If your Gaussian process has at most 4000 time points, then the covariance matrix is gonna be (4000 * 4000) doubles * 8 (bytes/double) / (1024 (kilobytes/byte) * 1024 (megabytes/kilobyte)) = 122 megabytes.

Which will happily fit in RAM if you have 16 gigs. The issue is saving the output. Things defined as parameters or transformed parameters will be saved in memory during the course of an RStan run. So 2000 * 0.122 > 16, and everything suffers. What you should do in this case is build the covariance matrix in places that don’t make output like the model block, or you can fence off calculations in transformed parameters if you need:

transformed parameters {
  real this_variable_will_get_saved;

  {
    real this_variable_will_not_get_saved;
  }
}

If the fixed lengthscale fits work, then there are a couple things you could do to repeat this without a fixed lengthscale, but let’s see if that works first (from the looks of your data, you can probably get a pretty good guess at the lengthscale).

Really cool post though! I enjoyed reading it. Tnx for the plots.

cdibble · August 21, 2018, 7:21am

Thank you for this! I’ve given it a go. I think that ultimately it is still going to be too computationally expensive; like Ben mentioned, the number of basis functions needed is too high. Here is the GP component with a fit to the first 1000 (of 3000 total) rows of data; in the first plot, M = 2 and in the second plot M = 5. For N = 1000 and M = 5, it took about 100 minutes to fit.
M = 2

M = 5

The values for the linear model parameters are reasonable, though the slope is surprisingly small.

I was wondering for this model about a couple of things.

I’m not clear on what happens with the wind ~ normal() bit here. Are we just looking at how a GP imputes wind values?

Also for this Time data, I used the same normalized time vector as before (though I’ve been calling it Distance). I assume this is still right and necessary since we are still using the approx_L()?

Let me know what you think- I’ll see if I can put together some generated quantities to take a look at the fit, but I don’t think this is going to solve the original problem of computational limits.

Cheers,

Connor

cdibble · August 21, 2018, 7:50am

Hey Ben,

Thanks for the advice. The INLA stuff does look like it might be a good lifeline and I am going to mess with it. But I’ve been trying to get a faster result by pre-computing the Cholesky like you suggested and I wanted to see what you think. Interestingly, the GP has taken over the model and pretty much just reproduces the data. The linear model parameters come out near zero and the predicted relationship between wind and currents isn’t quite right. The good news is that the residuals are gone.

Here’s the posterior mean of ‘g’, the GP component of the linear model, fit to just the first 500 rows:

You can see it reproduces the data pretty nicely. Here’s the residual plot:

And yet the simulations have it all backwards (this should be a positive relationship):

So maybe I did something wrong with this model, or maybe the GP just dominates. I would think that we could get more information out of the wind than this, though. I’ll put in my model below. I also tried the non-centered parameterization version, which I’ll paste at the bottom here, but I haven’t been able to fit that yet.

Let me know if you have any ideas about why the GP is vastly dominating now (maybe my choices for the rho_sq and eta_sq. It won’t be super fast, but this might just work if I can make sense of these results (and correct the model).

Model presented above:

data {
  int<lower=1> N; // number of (time) points in series
  matrix[N,N] L;

  int<lower=0> K; // number of predictos
  matrix[N, K] X; // predictor design matrix
  vector[N] response; // vector of Alongshore current speeds

//prediction
  int<lower=1> S; // length of wind series to simulate over
  int<lower=0> sK; // number of predictors for simulation
  matrix[S, sK] SIM; // matrix of ismulated values

}
parameters {
  vector[N] g; // g is the variable intercept offset
  vector[K] B; // coefficients of the model (both intercepts and slopes)
  real<lower=0> sigma_response;
  // real<lower=0> eta_sq;
  real<lower=0> inv_rho_sq;
  // real<lower=0> sigma_sq;
}
transformed parameters {
//   // matrix[N,N] Sigma;
//   matrix[N,N] L;
//   // off-diagonal elements of covariance matrix
//   for (i in 1:(N-1)){
//       for (j in (i+1):N){
//           Sigma[i,j] = cov_exp_quad(time, alpha, length_scale); // eta_sq * exp(-rho_sq * pow(Dmat[i,j],2));
//           Sigma[j,i] = Sigma[i,j];
//         }
//     }
//     // diagonal elements of covariance matrix
//     for(k in 1:N) {
//       Sigma[k,k] = eta_sq + sigma_sq; // adding jigger to diag elements
//     }
//   L = eta_sq * cholesky_decompose(Sigma);
}
model {
    vector[N] mu;
    //priors
    // eta_sq ~ cauchy(0,1);
    // inv_rho_sq ~ cauchy(5,2);
    // sigma_sq ~ cauchy(0,1);
    B[1] ~ normal(0,1); // intercept prior
    B[2:K] ~ normal(0,1); // slope priors
    g ~ normal(0, 1);
    // a ~ normal(0, 20);
    // b ~ normal(0, 1);
    sigma_response ~ normal(0, 1); // overall model sigma, but could use multilevel model so that sigma varies according to wind
    //model
    g ~ multi_normal_cholesky(rep_vector(0, N), L); // model variable intercepts as gaussian process over function Sigma
    mu = X * B + g ;
    response ~ normal(mu, sigma_response);
}
generated quantities{
  vector[S] sim_cur; //simulated response output
  vector[N] mu_pred;
  vector[N] pred_cur;
  vector[N] log_lik;
  vector[N] resids;
  // real log10error; // from Ben Bales on Stan forum- using this as a check of the GP approximation in approx_L()
  // {
  //   matrix[P, M] L = approx_L(M, scale, xp, sigma, l);
    
  //   log10error = log10(max(fabs(cov_exp_quad(xp, sigma, l) - L * L')));
  // }
    sim_cur = SIM * B; // simulated response across the range of params
    mu_pred = X*B + g; //+ approx_L(M, scale, Distance, inv_rho_sq) * g;
  for(i in 1:N){
    pred_cur[i] = normal_rng(mu_pred[i], sigma_response); // posterior predictions of the model
    log_lik[i] = normal_lpdf(response[i] | mu_pred[i], sigma_response); // compute log likelihood for us in AIC 
  }  
  resids = response - mu_pred;
}

R code to pre-compute L (thanks for the heads up about the upper triangle from chol()).

Sigma <- matrix(NA, ncol = N, nrow = N)
	rho_sq <- 0.06 #rnorm(1, mean = 0, sd = 1) # length scale
	curve(exp(-rho_sq*x^2), from=0 , to= 100 , lty=2 ,
    	xlab="distance" , ylab="correlation" )
	sigma_sq <- 1e-6
	eta_sq <- 0.9 # maximum covariance between any two points
	
N <- stan_data_i$N
	Dmat <- stan_data_i$Distance # this is the distance matrix for each step in time
	for(i in 1:(N-1)){
		for(j in (i + 1):N){
			Sigma[i,j] = eta_sq * exp(-rho_sq * (Dmat[i,j])^2);
			Sigma[j,i] = Sigma[i,j]
		}
	}
 	for(k in 1:N) {
      Sigma[k,k] = eta_sq + sigma_sq; #// adding noise to diag elements
    }
 	L = eta_sq * t(chol(Sigma)) # must transpose to get the lower triangle,

And here’s my attempt at the non-centered parameterization:

data {
  int<lower=1> N; // number of (time) points in series
  matrix[N,N] L;

  int<lower=0> K; // number of predictos
  matrix[N, K] X; // predictor design matrix
  vector[N] response; // vector of Alongshore current speeds

//prediction
  int<lower=1> S; // length of wind series to simulate over
  int<lower=0> sK; // number of predictors for simulation
  matrix[S, sK] SIM; // matrix of ismulated values

}
parameters {
  vector[N] g; // g is the variable intercept offset
  vector[K] B; // coefficients of the model (both intercepts and slopes)
  vector[N] z;
  real<lower=0> sigma_response;
  real<lower=0> sigma_gp;
  // real<lower=0> eta_sq;
  real<lower=0> inv_rho_sq;
  // real<lower=0> sigma_sq;
}
transformed parameters {
  vector[N] f = L * z; // non-centered parameterization;
//   // matrix[N,N] Sigma;
//   matrix[N,N] L;
//   // off-diagonal elements of covariance matrix
//   for (i in 1:(N-1)){
//       for (j in (i+1):N){
//           Sigma[i,j] = cov_exp_quad(time, alpha, length_scale); // eta_sq * exp(-rho_sq * pow(Dmat[i,j],2));
//           Sigma[j,i] = Sigma[i,j];
//         }
//     }
//     // diagonal elements of covariance matrix
//     for(k in 1:N) {
//       Sigma[k,k] = eta_sq + sigma_sq; // adding jigger to diag elements
//     }
//   L = eta_sq * cholesky_decompose(Sigma);
}
model {
    vector[N] mu;
    //priors
    // eta_sq ~ cauchy(0,1);
    // inv_rho_sq ~ cauchy(5,2);
    // sigma_sq ~ cauchy(0,1);
    B[1] ~ normal(0,1); // intercept prior
    B[2:K] ~ normal(0,1); // slope priors
    g ~ normal(0, 1);
    z ~ normal(0, 1);
    // a ~ normal(0, 20);
    // b ~ normal(0, 1);
    sigma_response ~ normal(0, 1); // overall model sigma, but could use multilevel model so that sigma varies according to wind
    sigma_gp ~ normal(0, 1);    
    //model
    g ~ normal(f, sigma_gp);//multi_normal_cholesky(rep_vector(0, N), L); // model variable intercepts as gaussian process over function Sigma
    mu = X * B + g ;
    response ~ normal(mu, sigma_response);
}
generated quantities{
  vector[S] sim_cur; //simulated response output
  vector[N] mu_pred;
  vector[N] pred_cur;
  vector[N] log_lik;
  vector[N] resids;
  // real log10error; // from Ben Bales on Stan forum- using this as a check of the GP approximation in approx_L()
  // {
  //   matrix[P, M] L = approx_L(M, scale, xp, sigma, l);
    
  //   log10error = log10(max(fabs(cov_exp_quad(xp, sigma, l) - L * L')));
  // }
    sim_cur = SIM * B; // simulated response across the range of params
    mu_pred = X*B + g; //+ approx_L(M, scale, Distance, inv_rho_sq) * g;
  for(i in 1:N){
    pred_cur[i] = normal_rng(mu_pred[i], sigma_response); // posterior predictions of the model
    log_lik[i] = normal_lpdf(response[i] | mu_pred[i], sigma_response); // compute log likelihood for us in AIC 
  }  
  resids = response - mu_pred;
}

andre.pfeuffer · August 21, 2018, 8:07am

Both wind and current are measurements. Having a measurement error, sigma_wind and sigma_current
is the standard deviation of them. If you not model the wind (or current visa versa), then you’ll take the measurement for granted, having no noise. The measurement noise is the noise of each observation.
Some Kalman filter literature may explains this better, than I’m able to do.

Renamed it for clarity, no to be confused with distance (matrix).

That is ok, the S in Stan not stands for fast. ;-)
Once you know your length-scale parameter, you may keep it fixed as Ben suggested and move it to the transformed data section.

Topic		Replies	Views
Advice me on GP prediction Modeling rstan , fitting-issues , specification	6	517	October 8, 2021
Gaussian Process with Dot-Product Covariance Modeling specification , cmdstanr , gaussian-process	0	814	February 10, 2022
Applied Gaussian Processes in Stan, Part 1. A Case Study Modeling	12	2709	November 21, 2019
Pre-computation of distance matrix in Gaussian Process? Modeling gaussian-process	4	927	July 15, 2022
Sharing a to-me-neat parameterization for Gaussian Processes Modeling techniques , specification	2	1097	April 10, 2021

Using GP for autocorrelation in a time-series model: memory pressure, GP kernel definition

Related topics