Counter-intuitive increase in variance with increased frequency data with a timeseries model

I have a time series model which I’m testing with some synthetic data. (included)
example_data.csv (604.6 KB)

The model mostly appears to be doing what I expect, and reproduces the mean of known parameters.
However, surprisingly when I subsample my input data at higher frequencies the variance in my most important parameter ( R ) increases.
So i presume i’m not actually modeling what I think i’m modeling, because I can’t understand why this would be the case.

This model is designed to work with a non-stationary but cyclic timeseries ( C ). C is controlled by advective movement in two directions u and v, with gradients beta_u and beta_v, plus a near constant negative term R.
The model works on the assumption that over a daily timescale beta_u and beta_v don’t change much, and R changes even slower (hence the “filter”). The model should then be able to simultaneously estimate the gradients and R.

The second question is, the cumulative value of the R parameter is important to me, in order to correctly reproduce the synthetic data I’ve allowed R_raw to be negative, however I know that this can’t actually be true with real data. How would i go about improving the model to account for this, without biasing the estimate of R towards positive numbers.
i.e. the model should still correctly estimate R = 0.

Example R code:

require(data.table)
require(rstan)
Sys.setenv(tz="UTC")
rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())

x = fread("example_data.csv")
  # example has:
  # R = 50
  # C ~N(C_true, 0.01)

# x = x[minute(dateTime) %in% c(0, 30)]
x = x[minute(dateTime) %in% c(0, 10, 20, 30, 40, 50)]
x[, dt := unique(diff(time))]

dx = list(
  N = nrow(x),
  day = x$day,
  N_day = max(x$day),
  filter = x$filter,
  N_filter = max(x$filter),
  C = x$C, # measured C
  BML = x$BML,
  u = x$u,
  v = x$v,
  dt = x$dt[1]
)

fit = stan("adv.stan", data=dx, chains=2, iter=2000)

e = rstan::extract(fit)
mean(e$sigma)
mean(e$R*(24*60*60))
quantile(e$R*(24*60*60), c(0.05, 0.95), names=F)
sd(e$R*(24*60*60))
mean(x$R*(24*60*60))

# with 30 min sampling sd(R) = 0.4
# 10 min sampling sd(R) = 0.7

Stan model adv.stan:

data {
  int<lower=1> N;  // number of samples
  int<lower=1> N_day; // number of days
  int<lower=1> N_filter; // number of filter
  int<lower=1> day[N];
  int<lower=1> filter[N];
  real C[N]; // observations
  vector[N] u; // east-west velocity
  vector[N] v; // north-south velocity
  real dt;
  real BML[N]; // scale
}
transformed data {
  real u_avg[N-1];
  real v_avg[N-1];

  for(i in 1:N-1){
    u_avg[i] = (u[i] + u[i+1]) * 0.5;
    v_avg[i] = (v[i] + v[i+1]) * 0.5;
  }

}

parameters {
  real<lower=0> sigma;
  real beta_u[N_day];
  real beta_v[N_day];
  real R_raw[N_filter];
}

transformed parameters {
  real C_hat[N-1];
  for(i in 1:N-1){
    C_hat[i] = C[i] + (( (u_avg[i] * dt * beta_u[day[i]] * 0.01) + (v_avg[i] * dt * beta_v[day[i]] * 0.01) - (R_raw[filter[i]] * dt * 0.001) ) / BML[i]);
  }
}

model {
  sigma ~ cauchy(0, 1);
  beta_u ~ normal(0, 1);
  beta_v ~ normal(0, 1);
  R_raw ~ normal(0, 1);
  // N=1 = start time = variables are attached to their start time
 for(n in 2:N){
   C[n] ~ normal(C_hat[n-1], sigma);
 }
}

generated quantities{
  real R[N_filter];
  real cumulative_R[N];
  real adv_u[N];
  real adv_v[N];
  for(d in 1:N_filter){
    R[d] = (R_raw[d] * 0.001);
  }
  cumulative_R[1] = R[filter[1]] * dt;
  adv_u[1] = (u[1] * dt * beta_v[day[1]] * 0.01);
  adv_v[1] = (v[1] * dt * beta_v[day[1]] * 0.01);
  for(n in 2:N){
    cumulative_R[n] = cumulative_R[n-1] + (R[filter[n]] * dt);
    adv_u[n] = (adv_u[n-1] + (u[n] * dt * beta_u[day[n]] * 0.01));
    adv_v[n] = (adv_v[n-1] + (v[n] * dt * beta_v[day[n]] * 0.01));
  }
}

Many thanks

I take it this means more measurements? Usually subsample means a subset of the sample. I’m not a time-series expert, but I’d look at the autocorrelation of the data in each of these cases. If you have additive process noise, then my intuition would match yours. (But as I say, not a time-series expert!)

Constrain it to be positive and give it a prior consistent with zero, e.g.,

parameters {
  real<lower = 0> R;

model {
  R ~ normal(0, 5);

or whatever’s reasonable to identify the scale of R. A lognormal, for example, and many shapes for gamma, aren’t consistent with zero.

You can use declare-define in generated quantities:

real R[n_filter] = 0.001 * R_raw;

I’m pretty sure you can simplify all those generated quantities using cumulative sum and vectorization of the indexes, e.g.,

cumulative_R = cumulative_sum(R[filter] * dt);

It’s the same synthetic timeseries sampled at different frequencies. the noise is added to the highest

I realise I’ve used some data.table specific syntax in the R code, basically the x=[minute… lines take the 1 second interval example data (which has had ~N(0, 0.01) noise added and subsample to the 10 minute, or on the half hour (not an average, it’s just the data at that time stamp).

I take it the model generally looks fine to you? I understand this isn’t particularly your domain.

Thanks for the syntax tips.

I understood the first sentence, but I didn’t follow the second.

probably because I didn’t finish it.

The timeseries is calculated every second, these values have the noise added, I then subsample to get a 10 minute or 30 minute spaced timeseries.

Constrain it to be positive and give it a prior consistent with zero, e.g.,

parameters {
  real<lower = 0> R;

model {
  R ~ normal(0, 5);

I tried this, but the resulting distribution for R does not have most of it’s mass on zero.

left plot is real<lower=0> R_raw[N_filter]
right is real R_raw[N_filter]

Your data is clearly consistent with a non-zero value for the parameter, with or without the constraint.

Generally, the posterior mean is just a point estimate that minimizes expected square error. There’s nothing else special about it. It may also be the maximum likelihood estimate, with or without the constraint. Again, nothing special about that.

If the parameter is positive, the result with lower = 0 is much better—it doesn’t say there’s a 50% chance that the value is negative!

If the value is truly zero, what you should find is that with more data, the probability mass will concentrate around zero (in both models). This is assuming you have a prior and likelihood that’s consistent with the data.

Thanks bob

So, perhaps as you say my prior and likelihood do not match my data, but If so I don’t understand what I’m doing wrong. Is it the way i’m adding the measurement error?

Below is a simplified, but at it’s core the same sort of model. complete with the data generating code.
You can see what i’m talking about if you switch between constrained and unconstrained R.

dat = data.frame(
  time = 0:100,
  M = 3.5,
  R = 0
)

est = vector(length=nrow(dat))
est[1] = 10
for(i in 2:(nrow(dat))){
  est[i] = est[i-1] + dat$M[i] + dat$R[i]
}
dat$est = est
dat$error = rnorm(nrow(dat), 0, 1)
dat$C = dat$est + dat$error # is this a reasonable way to add noise?

m1 = {"
  data {
    int<lower=1> N;
    real C[N];
    real M[N];
  }
  
  parameters {
    real<lower=0> sigma;
    real<upper=0> R;
    // real R;
  }
  
  transformed parameters {
    real C_hat[N-1];
    for(i in 1:N-1){
      C_hat[i] = C[i] + M[i] + R;
    }
  }
  
  model {
    sigma ~ cauchy(0, 1);
    R ~ normal(0, 1);
   for(n in 2:N){
     C[n] ~ normal(C_hat[n-1], sigma);
   }
  }
"}

dx = list( N = nrow(dat), M = dat$M, C = dat$C )

fit <- stan(model_code=m1, data=dx, chains=1, iter=2000, control=list(adapt_delta = 0.80))
e = extract(fit)

plot(density(e$R))

Thanks

I was assuming that your prior and likelihood were consistent. Even so, it’s not inconsistent with mass away from zero.

We’re not doing maximum likelihood estimation. The Bayesian posterior mean is the value that minimizes the expected square error in the parameter, not the value that maximizes the (penalized) likelihood.

In other words, I don’t think it’s a problem. And I don’t think you want to use the unconstrained value to get the posterior mean you want. What we carea about is interval coverage and what the model says predictively about parameters or future observations. The version with illegal values is going to break down in those ways that the positive-constrained one will not.

[Edit: P.S. To avoid multiple compilation, you can make the lower bound a data variable (if it’s positive_infinity(), there’s no constraint). E.g.,

data {
  int<lower = 0, upper = 1> bounded;
...
parameters {
  real<upper = (bounded ? 0 : positive_infinity())> alpha;
1 Like