Hi All,
I am attempting to model the response of ocean currents to wind forcing as part of a study of coastal oceanographic processes around headlands. My hope is to find the right framework in which to deal with these highly autocorrelated time-series so that I can extend that framework to model numerous different individual relationships between forcing mechanisms (wind, tides, etc.) and responses (current speed, vorticity, etc.).
My currents are measured at two-minute intervals, but I’ve been working with hourly averages because of the very large amount of data at 2 minutes. Ideally I would be able to use the full data set, but I can’t even get a model run with hourly data, so let’s start with that.
I have about 2000 observations of currents and wind. I have been trying to fit the model to current data that I have harmonically detided and low-pass filtered, though I could use the raw data (and supply it here if requested) as well. Here is what the series looks like (currents in black, wind in red).
My problem is that I am using a ton of RAM to deal with this covariance matrix for the GP in the model. I can fit to a subset of the data on my laptop (R version 3.4.3 (2017-11-30)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.3) but it takes hours to fit even 1/3 of the data set and I can’t do anything with it after the fact without R crashing (I can look at model output when fitting to just the first 200 rows of the data). I have a lab computer that has a better processor but the same amount of RAM (16GB) and is running 64-bit Windows 10 that won’t even run the sampling (returns “Error: cannot allocate vector of size [N] mb”).
I apologize for my ignorance, but I am not sure whether I can improve performance with my current model specification or whether I need to change entirely the way I am specifying the model. I am using a full distance matrix with the “distance” in hours between all of the data points, so this is the apparent choke point in the model. Is there a better way to define the gaussian process kernel function for this purpose? If not, is there some way to improve performance in the way I have specified the stan model? OR should I just drop the gaussian process idea and try a different way of modeling the relationship between two autocorrelated time-series? Any help or references would be greatly appreciated. Stan code below and data attached. Thanks in advance!
-Connor Dibble
Here’s what I am passing as stan data. The wind series has been centered and standardized.
```
> str(stan_data)
List of 7
$ N : int 1920
$ time : num [1:1920] 1.46e+09 1.46e+09 1.46e+09 1.46e+09 1.46e+09 ...
$ currents: num [1:1920] 369 373 377 382 386 ...
$ wind : num [1:1920] -0.2113 -0.0332 0.0432 0.0355 0.271 ...
$ Dmat : num [1:1920, 1:1920] 0 1 2 3 4 5 6 7 8 9 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:1920] "1" "2" "3" "4" ...
.. ..$ : chr [1:1920] "1" "2" "3" "4" ...
$ sim_wind: num [1:100] -25 -24.6 -24.3 -23.9 -23.6 ...
$ S : num 100
> <a class="attachment" href="//cdck-file-uploads-canada1.s3.dualstack.ca-central-1.amazonaws.com/flex030/uploads/mc_stan/original/2X/4/450e80f2007e00472fd2c20b36d17bff37db20a0.R">m1_stan_data.R</a> (475.9 KB)
model.fit <- rstan::sampling(object = model, data = stan_data, chains = 1, iter = 5000,
warmup = 1000, refresh = 250)
```
And here is my stan model:
```
data {
int<lower=1> N; // number of (time) points in series
vector[N] currents; // vector of Alongshore current speeds
vector[N] wind; // vector of wind speeds
matrix[N,N] Dmat; // Distance matrix (scaled = /3600)
//prediction
int<lower=1> S; // length of wind series to simulate over
vector[S] sim_wind; // vector of wind values to simulate over
}
parameters {
vector[N] g; // g is the variable intercept offset
real a; // overall model intercept
real b; // slope describing relationship to wind predictor
real<lower=0> sigma_currents;
real<lower=0> eta_sq;
real<lower=0> inv_rho_sq;
real<lower=0> sigma_sq;
}
transformed parameters {
matrix[N,N] Sigma;
matrix[N,N] L;
real<lower=0> rho_sq;
rho_sq = inv(inv_rho_sq);
// off-diagonal elements of covariance matrix
for (i in 1:(N-1)){
for (j in (i+1):N){
Sigma[i,j] = eta_sq * exp(-rho_sq * pow(Dmat[i,j],2));
Sigma[j,i] = Sigma[i,j];
}
}
// diagonal elements of covariance matrix
for(k in 1:N) {
Sigma[k,k] = eta_sq + sigma_sq; // adding jitter to diag elements
}
L = cholesky_decompose(Sigma);
}
model {
vector[N] mu;
//priors
eta_sq ~ cauchy(0,1);
inv_rho_sq ~ cauchy(0,1);
sigma_sq ~ cauchy(0,1);
a ~ normal(0, 20);
b ~ normal(0, 1);
sigma_currents ~ normal(0, 200); // overall model sigma
//model
g ~ multi_normal_cholesky(rep_vector(0, N), L); // model variable intercepts as gaussian process over function Sigma
for (i in 1:N) {
mu[i] = a + g[i] + b * wind[i];
}
currents ~ normal(mu, sigma_currents);
}
generated quantities{
vector[S] sim_cur; //simulated currents output
sim_cur = a + b*sim_wind;
}
```
m1_stan_data.R (475.9 KB)