Bayesian Measurement Error Model in Stan

Hi everyone,

I’m working on a Bayesian Measurement Error Model in Stan and would appreciate your feedback on whether my code is correctly specified. The objective is to account for measurement errors in both x (Method 1) and y (Method 2) while modeling their true relationship using a Bayesian approach.

  • Method 1 estimates evapotranspiration,
  • Method 2 measures evapotranspiration and is treated as the reference.

Both methods are assumed to have a 10% error. I’d greatly appreciate any insights or suggestions on improving the model specification.
Thank you in advance

Stan code:

data {
int<lower=0> N;                      // Number of data points (days)
vector[N] x_obs;                     // Observed values for x (Method 1)
vector[N] y_obs;                     // Observed values for y (Method 2)
real<lower=0> sigma_x;         // Known measurement error in Method 1 (x) (10%)
real<lower=0> sigma_y;         // Known measurement error in Method 2 (y) (10%)
}

parameters {
real alpha;                              // Intercept
real beta;                                // Slope
real<lower=0> sigma;             // Residual standard deviation
}

model {
// Priors
alpha ~ normal (0, 10);            // Weakly informative prior for intercept
beta ~ normal (0, 10);              // Weakly informative prior for slope
sigma ~ cauchy (0, 5);             // Weakly informative prior for residuals

// Structural model with error propagation for both x and y
y_obs ~ normal(alpha + beta * x_obs, sqrt(sigma^2 + sigma_y^2 + beta^2 * sigma_x^2));
}

Hi, @Alessandro, and sorry for missing this earlier.

The usual thing to do here is to treat the true x as an unkown. The measurement error in y is rolled into sigma_y. Then you’d have something like this, where you fit sigma_x, sigma_y and x,

parameters {
  vector[N] x;
  real<lower=0> sigma_x;
  real<lower=0> sigma_y;

model {
  sigma_x ~ normal(0.1, 1);
  sigma_y ~ normal(0.1, 1);
  x ~ normal(x_obs, sigma_x);
  ...
  y_obs ~ normal(alpha + beta * x, sigma_y);
}

The sigma_y variable soaks up modeling error and observation error in y_obs.

You’d want to adjust location and scale of sigma_x and sigma_y.

Hi @Bob_Carpenter,
Thank you very much for your response. My overall goal is to evaluate the quality and readability of Method 1 as a method for estimating daily evapotranspiration, using a model that explicitly accounts for uncertainty in both the predictor and the outcome. I’d also like to clarify that my dataset includes daily measurements of evapotranspiration, and the values are different each day. The two methods used are:
-Method 1, which provides an estimation
-Method 2, which serves as the reference method

I’ve adapted the Stan code according to your suggestions:

stan_code <- "
data {
  int<lower=1> N;
  vector[N] x_obs;              
  vector[N] y_obs;              
}
parameters {
  vector[N] x;                  
  real alpha;
  real beta;
  real<lower=0> sigma_x;        
  real<lower=0> sigma_y;        
}
model {
  // Priors
  sigma_x ~ normal (0.24, 0.11);     // 10% error (sigma_x <- mean(Method 1) * 0.10)  and  sd
  sigma_y ~ normal (0.17, 0.09);     // 10% error (sigma_y<- mean(Method 2) * 0.10) and sd
  alpha ~ normal (1.49, 0.5);
  beta ~ normal (1.40, 0.5);

  x_obs ~ normal (x, sigma_x);  
  y_obs ~ normal (alpha + beta * x, sigma_y);  
}
generated quantities {
  vector[N] y_pred;
  vector[N] residual_error;

  for (i in 1:N) {
    y_pred[i] = normal_rng(alpha + beta * x[i], sigma_y);
    residual_error[i] = fabs(y_obs[i] - y_pred[i]);   //. this should be the error of Method 1
  }
}

I have additional questions to clarify some aspects of the model. I would really appreciate it if you could take a look and help me to clarify the points below:

  1. Is the way I adjusted the location and scale of sigma_x and sigma_y appropriate?

  2. If my goal is to account for the error in Method 1 (the estimation of ET), should I model Method 1 as x (predictor) and Method 2 (reference method) as y (the outcome), or the other way around?

  3. To achieve my goal, is it appropriate to use the model above and explicitly include measurement error in both Method 1 and Method 2? Or is it also valid to include measurement error only for Method 2, assuming that Method 1 represent the modeled outcome?

I’m sorry, but I don’t understand what you mean by “provides an estimate” and “serves as the reference method”. By “method”, do you mean statistical model? If so, I only see one Stan program. This also meant I didn’t understand question (2) or (3) below.

You meaning the priors? I didn’t understand the 10% error comments. Where did those priors come from? The sigma_x should reflect what you know about the observation process and shouldn’t be something fine tuned to two decimal places from the data if that’s where the numbers come from.

This looks like a standard measurement error model.

In your residual error calculation, You can just write

vector[N] residual_abs_error = fabs(alpha + beta * x - y_obs);

We know that the mean of y_pred is going to be alpha + beta * x, so if all we want is a tight estimate of the posterior mean of the residual error, it’s best not to introduce more uncertainty (this follows formally from the Rao-Blackwell theorem). If you wanted good uncertainty estimates for new instances you’d need the RNG.

P.S. I’d remove the space between normal (.

Thanks again for your response. I’d like to clarify the context a bit further.

I am working with Evapotranspiration (ET) that is the total amount of water transferred from land to the atmosphere. Specifically, I am working on the evaluation of two different methods for detecting daily evapotranspiration (ET) in a fruit tree orchard, Method 1 and Method 2.

Regarding the first point:
Method 1: ET estimates from remote sensing (satellite data) (x_obs in the model). An indirect estimation of ET.
Method 2: ET measured directly in the field (with sensors of temperature, humidity etc..), which is considered the reference method in the literature (y_obs in the model).

Regarding the second point:
Yes, I mean the priors. While Method 2 is used as a benchmark, it’s important to note that even these reference measurements carry some uncertainty. it is well established in the literature that Method 2 values typically have an uncertainty or error bound of about 10%. This where the 10% value come from.

Are you trying to put them together into one joint model? that’s easy to do in a Bayesian setting if the models share parameters.

So the 10% is the measurement error scale? It was a couple steps of indirection.