Counter-intuitive increase in variance with increased frequency data with a timeseries model


#1

I have a time series model which I’m testing with some synthetic data. (included)
example_data.csv (604.6 KB)

The model mostly appears to be doing what I expect, and reproduces the mean of known parameters.
However, surprisingly when I subsample my input data at higher frequencies the variance in my most important parameter ( R ) increases.
So i presume i’m not actually modeling what I think i’m modeling, because I can’t understand why this would be the case.

This model is designed to work with a non-stationary but cyclic timeseries ( C ). C is controlled by advective movement in two directions u and v, with gradients beta_u and beta_v, plus a near constant negative term R.
The model works on the assumption that over a daily timescale beta_u and beta_v don’t change much, and R changes even slower (hence the “filter”). The model should then be able to simultaneously estimate the gradients and R.

The second question is, the cumulative value of the R parameter is important to me, in order to correctly reproduce the synthetic data I’ve allowed R_raw to be negative, however I know that this can’t actually be true with real data. How would i go about improving the model to account for this, without biasing the estimate of R towards positive numbers.
i.e. the model should still correctly estimate R = 0.

Example R code:

require(data.table)
require(rstan)
Sys.setenv(tz="UTC")
rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())

x = fread("example_data.csv")
  # example has:
  # R = 50
  # C ~N(C_true, 0.01)

# x = x[minute(dateTime) %in% c(0, 30)]
x = x[minute(dateTime) %in% c(0, 10, 20, 30, 40, 50)]
x[, dt := unique(diff(time))]

dx = list(
  N = nrow(x),
  day = x$day,
  N_day = max(x$day),
  filter = x$filter,
  N_filter = max(x$filter),
  C = x$C, # measured C
  BML = x$BML,
  u = x$u,
  v = x$v,
  dt = x$dt[1]
)

fit = stan("adv.stan", data=dx, chains=2, iter=2000)

e = rstan::extract(fit)
mean(e$sigma)
mean(e$R*(24*60*60))
quantile(e$R*(24*60*60), c(0.05, 0.95), names=F)
sd(e$R*(24*60*60))
mean(x$R*(24*60*60))

# with 30 min sampling sd(R) = 0.4
# 10 min sampling sd(R) = 0.7

Stan model adv.stan:

data {
  int<lower=1> N;  // number of samples
  int<lower=1> N_day; // number of days
  int<lower=1> N_filter; // number of filter
  int<lower=1> day[N];
  int<lower=1> filter[N];
  real C[N]; // observations
  vector[N] u; // east-west velocity
  vector[N] v; // north-south velocity
  real dt;
  real BML[N]; // scale
}
transformed data {
  real u_avg[N-1];
  real v_avg[N-1];

  for(i in 1:N-1){
    u_avg[i] = (u[i] + u[i+1]) * 0.5;
    v_avg[i] = (v[i] + v[i+1]) * 0.5;
  }

}

parameters {
  real<lower=0> sigma;
  real beta_u[N_day];
  real beta_v[N_day];
  real R_raw[N_filter];
}

transformed parameters {
  real C_hat[N-1];
  for(i in 1:N-1){
    C_hat[i] = C[i] + (( (u_avg[i] * dt * beta_u[day[i]] * 0.01) + (v_avg[i] * dt * beta_v[day[i]] * 0.01) - (R_raw[filter[i]] * dt * 0.001) ) / BML[i]);
  }
}

model {
  sigma ~ cauchy(0, 1);
  beta_u ~ normal(0, 1);
  beta_v ~ normal(0, 1);
  R_raw ~ normal(0, 1);
  // N=1 = start time = variables are attached to their start time
 for(n in 2:N){
   C[n] ~ normal(C_hat[n-1], sigma);
 }
}

generated quantities{
  real R[N_filter];
  real cumulative_R[N];
  real adv_u[N];
  real adv_v[N];
  for(d in 1:N_filter){
    R[d] = (R_raw[d] * 0.001);
  }
  cumulative_R[1] = R[filter[1]] * dt;
  adv_u[1] = (u[1] * dt * beta_v[day[1]] * 0.01);
  adv_v[1] = (v[1] * dt * beta_v[day[1]] * 0.01);
  for(n in 2:N){
    cumulative_R[n] = cumulative_R[n-1] + (R[filter[n]] * dt);
    adv_u[n] = (adv_u[n-1] + (u[n] * dt * beta_u[day[n]] * 0.01));
    adv_v[n] = (adv_v[n-1] + (v[n] * dt * beta_v[day[n]] * 0.01));
  }
}

Many thanks