Simple measurement error model - unexpected output - misspecified?


#1

I have a simple time series model.

C_{t+1} = C_t - (R / h_t)

C and h are observed, with a known error on the scaling term h.

Which I’ve modelled as

C_{t+1} = C_t + (R / h_t)
\hat{C} \sim N(C, \sigma)
\hat{h} \sim N(h, \theta)
R \sim N(0, 10)
\sigma \sim N(0, 1)

The model converges just fine…
But I’m surprised by the posterior distributions for R.
It appears to be independent of \theta, which given that h scales R this is not what I would expect.
The point being in the real world h Is poorly known, so I want my estimate for R to reflect that.
If I feed in a “wrong” value for h, but still well within \sim N(h, \theta) my estimate for R is pretty certain around the wrong R value, not the desired behaviour.
I suspect I’m not specifying my error model correctly, but I can’t see where I’ve gone wrong.

  data {
    int<lower=1> N; 
    vector[N] C_hat;
    vector[N] h_hat;
    real theta;
  }
  
  parameters {
    real<lower=0> sigma;
    real R;
    vector<lower=1>[N] h;
    real init;
  }
  
  transformed parameters {
    real C[N];
    C[1] = C_hat[1] + init;
    for(i in 1:(N-1)){
      C[i+1] = C[i] - (R / h[i]);
    }
  }
  
  model {
    sigma ~ normal(0, 1);
    R ~ normal(0, 10);
    init ~ normal(0, 1);
    h_hat ~ normal(h, theta);
    C_hat ~ normal(C, sigma);
  }

R code to test below

h = 20
R = 0.2
C = rep(300, 1000)
for(i in 2:1000){
  C[i] = C[i-1] - (R / h)
}
C = C + rnorm(1000, 0, 0.5)
plot(C, type="l")

theta=2
x = stan(model = "test.stan",
         data = list(N = 1000, C_hat = C, h_hat = rep(20, 1000), theta = theta),
         chains = 1)
x = extract(x)
hist(x$init)
C[1] - 300

hist(x$h)
sd(x$h) # theta = 2

hist(x$R)
hist((0.2 * 20) / rnorm(2000, 20, theta)) # expected minimum distribution of R.
sd(x$R)
sd((0.2 * 20) / rnorm(2000, 20, theta)) # order of magnitude larger than stan estimate


#2

At first glance your model seems OK. And I don’t really find the behavior you describe that surprising - you essentially have 1000 observations of R and the error caused by \theta can easily cancel out on such a big dataset. I.e. you can easily have a very precise estimate of R even when your information about h is very noisy.

Could you elaborate on why you think this should not be the case?


#3

I don’t think It should cancel out.

In this example, with h = 20, the 0.5-95% on R is 0.193 to 0.198, If I set h = 21, my bounds for R don’t even span the previous range (0.213 to 0.218). and given I know the true value of R for this data (0.2), this is not a good estimate.

whereas If I rebuild the model with h just being a single real value over the whole timeseries R looks much more reasonable, and increasing theta results in increasing uncertainty in R.

The reason why I don’t want to parametrize h as a single value is that I know it can change, and I know if it’s increasing or decreasing, but I’m uncertain of the true value.


#4

I think there might be a problem caused by the way you simulate data (assuming you are still using the code in the first post). Note that your simulation process is different than what the model assumes - h_hat is not variable (basically theta is 0, but 2 is given to the model), init is 300, which is way out of the normal(0,1) prior…
It is hard to guess how this affects the model, but I would not be surprised if those differences explained the discrepancy you have observed.

I would suggest you test the model with data simulated exactly as in the model - draw sigma, R, init from their priors, choose h and theta a and then sample h_hat, compute C and sample C_hat. My best bet is that the parameters will be recovered correctly once you do that.


#5

Thanks for the quick reply.

The problem there is that my h could all be the same. my observations for h are not “noisy”, they are just known to be inaccurate. I may have 20, but it could be 19.
This is why I started with this case where h isn’t moving.

hmm, perhaps I should model it more like
h[i] * scale_error
where
scale_error ~ normal(1, 0.05)

init isn’t 300, it’s 0.6 with the above seed,
init is the error in the first observation (which I know is 300 in this case, so I’m just verifying that stan has correctly determined that)