Another measurement error modeling question

I have come across an interesting dilemma comparing an “intuitive” likelihood formulation for an error-in-variables model versus the standard Bayesian measurement error model recommended in the Stan docs (one that I’ve used to good success in the past). Consider (assuming \sigma_m is known):

x_{true,i} = x_{obs,i} + \delta_i
\delta_i \sim \text{Normal}(0,\sigma_m)
y_i=a+b*x_{true,i}+\epsilon_i
\epsilon_i\sim\text{Normal}(0,\sigma)
So, substituting:
y_i = a+b(x_{obs,i}+\delta_i)+\epsilon_i
y_i = a+b*x_{obs,i}+b*\delta_i+\epsilon_i
b*\delta_i+\epsilon_i\sim\text{Normal}(0,\sqrt{(b\sigma_m)^2+\sigma^2})
Therefore conclude y_i \sim \text{Normal}(a+b*x_{obs,i}),\sqrt{(b\sigma_m)^2+\sigma^2})

Leading to the following Stan code:

data {
  int<lower=0> N;
  vector[N] y;
  vector[N] x_obs;
  real sigma_m;
}

parameters {
  real a;
  real b;
  real<lower=0> sigma;
}

model {
  a ~ normal(0,5);
  b ~ normal(0,5);
  sigma ~ cauchy(0,5);
  y ~ normal(a+b*x_obs, sqrt(sigma^2+(b*sigma_m)^2));
}

Compared with the usual formulation

data {
  int<lower=0> N;
  vector[N] y;
  vector[N] x_obs;
  real sigma_m;
}

parameters {
  real a;
  real b;
  vector[N] x_true;
  real<lower=0> sigma;
  real mu_x;
  real<lower=0> sd_x;
}

model {
  a ~ normal(0,5);
  b ~ normal(0,5);
  sigma ~ cauchy(0,5);
  x_true ~ normal(mu_x,sd_x);
  x_obs ~ normal(x_true,sigma_m);
  y ~ normal(a+b*x_true, sigma);
}

Now, when fitting both models to simulated data the second one recovers the known a,b, and \sigma values. However, for the first one the b estimates are biased consistently low – in fact, they are exactly equal to the estimates if one just fit a linear regression of y on the noisy x (i.e. the classic regression dilution). The \sigma estimates are close to the known value. This was surprising to me, particularly because the first formulation avoids having to treat x_{true} as a model parameter and subsequent slower estimation and need for performance tuning.

My rough understanding is that the first model accounts for the extra vertical “spread” in the data but not the extra horizontal “spread” leading to the diluted slope estimates. Can anyone provide a more rigorous description or perhaps identify an error in the derivation of the first model?

4 Likes

I’ve been really enjoying and also somewhat perplexed by this question, because I can see intuitively that the answer is wrong, but I don’t understand from your derivation why the answer is wrong (which step goes awry). I bet @martinmodrak or @nhuurre could set us straight.

To see intuitively that the final result doesn’t work, consider two Stan statements. Statement 1 doesn’t encode the measurement error part at all:

target += normal_lpdf(y | a + b * x_obs, sigma);

Statement 2 is your version that doesn’t work as intended:

target += normal_lpdf(y | a + b * x_obs, sqrt(sigma^2+(b*sigma_m)^2));

We know that the correct model should estimate a steeper slope (larger magnitude for b) than the first model, but we can also see that there’s no reason (assuming weak priors) that the second model would estimate a larger slope than the first. We’ll show this by contradiction. Assume that the best estimate for b is steeper in the second model than in the first model. If so, then the first model can exactly recapitulate the second model by estimating a larger magnitude for b and also a larger value for sigma. Under a weak prior on sigma, there’s no penalty for doing so. And so there’s every reason to think that these two models should recapitulate very similar estimates for b.

I think the problem is that the “intuitive” derivation has no way to incorporate the prior on x_true which is included in the “usual” formulation. If you remove the prior on x_true (and hence also the mu_x and sd_x parameters) from the “usual” formulation you get a somewhat badly behaved model whose results match the “intuitive” version almost exactly.

Note that the prior makes the x_{true} correlated, so y_i are not conditionally independent, given just a and b (though they are conditionally independent, given a, b and x_{true}). In this normal-normal setting, you could definitely still marginalize out the x_{true}, but you’d need to build the respective covariance matrix for y_i and use a multivariate normal likelihood.

Does that make sense?

2 Likes

Totally makes sense, but I’m also somewhat startled by the implications. Here the generative model is

x_{true} \sim \textrm{N}(\mu_x, \sigma_x)
x_{obs} \sim \textrm{N}(x_{true}, \tau)
y \sim \textrm{N}(a + bx_{true}, \sigma_r)

This generative model, on its face, contains a pair of assumptions about the true values x_{true} that are not required in normal regression:

  1. x_{true} is normally distributed in the population
  2. Samples are chosen at random, without reference to the values of x_{obs} (e.g. no stratification).

Are these assumptions lurking inside measurement-error models all the time? (this would imply, for example, that the assumptions for these models are never met if x_{obs} is clearly not normally distributed).

1 Like

Thanks for the input @martinmodrak and @jsocolar. There are a good number of papers in my field that are based on the first formulation so I’m trying to make sure I have everything straight before going around claiming its incorrect. One would hope that simply showing that the first version does not correctly recover known parameter values but the second does would be sufficient but I suspect that people may expect a more theoretical justification.

Here is what I came up with. To see exactly where it breaks down we can return to the derivation where we substitute and arrive at:
y_i = a+b*x_{obs,i}+b*\delta_i+\epsilon_i
If we define
\eta_i=b*\delta_i+\epsilon_i\sim\text{Normal}(0,\sqrt{(b\sigma_m)^2+\sigma^2})
then
y_i = a+b*x_{obs}+\eta_i
The problem here is that \eta_i and x_{obs,i} are not independent like \epsilon_i and x_{true,i} so the jump to
y_i \sim \text{Normal}(a+b*x_{obs,i},\sqrt{(b\sigma_m)^2+\sigma^2}) is not justified. My difficulty has been in explaining exactly why. But it is relatively easy to show with simulation – fixing a, b, \sigma and \sigma_m:

x_true <- rnorm(n,mu_x,sd_x)
x_obs <- rnorm(n,x_true,sigma_m)
y <- rnorm(n,a+b*x_true,sigma)
epsilon <- y - (a+b*x_true)
eta <- y - (a+b*x_obs)

and compare \epsilon vs x_{obs} (or true) and \eta vs x_{obs}.

Is this convincing?

1 Like

I find this convincing. It’s also convincing, I think, to point out that \eta_i and x_{obs,i} both contain a term involving \delta_i, so we wouldn’t expect independence here.

Finally, it’s useful to observe: if the data generating process does not begin from fixed x_{true} but rather begins from fixed x_{obs} and then simulates x_{true}, you don’t end up with problems in estimation:

N <- 10000
b <- 1

x_obs <- rnorm(N)
x_true <- rnorm(N, mean = x_obs)
y <- rnorm(N, b*x_true)
lm(y ~ x_obs)

x_true <- rnorm(N)
x_obs <- rnorm(N, mean = x_true)
y <- rnorm(N, b*x_true)
lm(y ~ x_obs)

That is very interesting. But, it seems like they aren’t exactly equivalent. The first approach implies x_{true } \sim Normal(0,\sqrt{2}) and x_{obs}\sim Normal(0,1) and the second model implies the opposite.

Exactly :)

Edit: I should be more explicit with what I’m driving at. The first data generating process is what you imply with

If you instead wrote
x_{obs,i} \sim x_{true,i} then you would see that to complete a generative model you need a way to generate the values of x_{true}.

I’d be careful with the conclusions here. As @jsocolar notes, the two variants make different assumptions about the data-generating process. Which of the two (if any) is "correct " depends on what assumptions are close enough to what has actually happened in the real world.

Note that despite differing only in the prior distribution, this can be quite important (as you’ve seen), because you have only a single observation for each element of x_true and so the prior never washes away.

Neither do those two options exhaust the possibilities: you could put independent non-flat priors (e.g. x_true ~ normal(0, 10)). Or you could keep the partial pooling but let the prior mean different between subgroups of data, or even be driven by a Gaussian process indexed by another covariate…

As always, the only semi-reliable way to decide between competing models is a combination of domain expertise and empirical testing of the model’s assumptions/consequences. In the case here, as @jsocolar notes, the partial pooling approach implies a normal distribution of x_obs, which is relatively easy to check. If x_obs is wildly non-normal, we can consider the partial pooling approach to be falsified. The marginalized version implies conditional independence of y given a and b, which might be testable by some residual analysis (not 100% sure about this, didn’t really test it or think through)…

Hope that clarifies more than confuses.

2 Likes

Could you share a paper or two that takes this approach? Or has someone else implemented it in R that demonstrates that it works?

I could look around though I don’t think there are any that provide the code to estimate the model – usually just a quick derivation of the likelihood and then a table of estimated coefficients.

Yes, these are very good points to consider.

I am currently thinking about this right here. I am curious in what scenario x_{obs} would represent the underlying truth and x_{true} would be the noisy measurement. It seems to me that this is an incorrect causal direction.

Here’s a pair of examples:

  1. Suppose that I measure some time-varying process today, yielding x_{obs} but I believe that the response in the regression corresponds to the value of the process tomorrow (x_{true}). If the process evolves via a Gaussian random walk, then x_{obs} is not random but x_{true} is, and can be represented as
x_true ~ normal(x_obs, sigma1);
  1. Suppose we estimate x_{true} via a set of independent Bayesian models (one per observation) that yield uncorrelated Gaussian posteriors for x_{true}. Then we might use these posteriors as priors on the x_{true} in downstream inference. If we call the posterior means x_{obs}, then the model we end up with again is
x_true ~ normal(x_obs, sigma1);

It’s subtle (for me at least) to appreciate why it isn’t appropriate to consider the posterior means x_{obs} to be noisy measurements of the underlying x_{true}, and to treat this as a classic measurement error model.