Modleing systematic error issue in STAN

Hello,
I have a variable/feature measured which has a systematic error. I want to model its systematic error and estimate/predict true measurement. I modled as below but I am not sure if it is correct. I wonder if I need y here and whether y or x_meas are estimated true measurements.
I also tried to get loo (fit) but got an error regarding log-lik. Here is my code:




stan_code <- "
data {
  int<lower=0> N;         // number of observations
  real x_meas[N];         // measurements of x
  real<lower=0> tau;      // measurement noise (standard deviation)
  real y[N];              // observed outcomes
}

parameters {
  real x[N];              // unknown true values of x
  real mu_x;              // prior location parameter for x
  real<lower=0> sigma_x;  // prior scale parameter for x
  real alpha;             // intercept for model y
  real<lower=0> sigma;    // standard deviation of residuals
}

model {
  // Priors
  mu_x ~ normal(0, 10);   // prior for mu_x
  sigma_x ~ normal(0, 10); // prior for sigma_x
  
  // Prior for x
  x ~ normal(mu_x, sigma_x);
  
  // Measurement model
  x_meas ~ normal(x, tau);
  
  // Model for y
  y ~ normal(alpha, sigma);
}

generated quantities {
  real log_lik_x_meas[N];
  real log_lik_y[N];
  real log_prior = normal_lpdf(mu_x | 0, 10) + normal_lpdf(sigma_x | 0, 10);
  real x_pred[N];
  real y_pred[N];
  
  for (n in 1:N) {
    // Log-likelihood for measurement model
    log_lik_x_meas[n] = normal_lpdf(x_meas[n] | x[n], tau);
    
    // Log-likelihood for y model
    log_lik_y[n] = normal_lpdf(y[n] | alpha, sigma);
    
    // Predictive distribution for x
    x_pred[n] = normal_rng(mu_x, sigma_x);
    // Predictive distribution for y
    y_pred[n] = normal_rng(alpha, sigma);
  }
}
"
y <- d$hb
n <- length(x_meas)
x_meas<-x_meas 


tau <- 0.7                  # example measurement noise

# Data to be passed to the Stan model

stan_data <- list(N = n, x_meas = x_meas,tau = tau,
                  y=y)

fit <- stan(file = "custom_model.stan", data = stan_data, iter = 100000, chains = 6,init=0, 
            control = list(adapt_delta = 0.99, max_treedepth = 15),cores=future::availableCores())

Do I need y variable here as I only model the systematic error?

What you’re doing is correct, but not really useful because the models for x and y are entirely separate right now. That is, you could drop everything containing y and still get the same results for x. Hence, the true values for x are entirely informed by the measured values (and to some extent by the prior).

If you can specify the relation between y and the true values for x (see the docs for an example), then the estimates for the true values of x will be driven both by the measured values of x and by the association with the observed outcomes y. This works even better if you are also able to include the relationship between the mismeasured x and other features.

The problem with the likelihood may solve itself this way, but the full error message would surely help to say more.

1 Like

@mippeqf thank you for your input. Indeed, I want to estimate the true x value independent of y based on prior and observed values.
For example, we measure blood hemoglobin in different laboratories and there are measurement errors in some laboratories. Independent of the laboratory site, we want to estimate the measurement error. For tau, I considered a prior value less than the observed variance and do not have an informative prior of true x. This syntax reduced variance without shifting the mean which is what we would expect. This is a graph from the above syntax on an example.
measured_xVS.True_x
Thanks for your feedback.

@mippeqf I used the below model for the above figure.

stan_code <- "
data {
  int<lower=0> N;         // number of observations
  real x_meas[N];         // measurements of x
  real<lower=0> tau;      // measurement noise (standard deviation)
  
}

parameters {
  real x[N];              // unknown true values of x
  real mu_x;              // prior location parameter for x
  real<lower=0> sigma_x;  // prior scale parameter for x
 
}

model {
  // Priors
  mu_x ~ normal(0, 10);   // prior for mu_x
  sigma_x ~ normal(0, 10); // prior for sigma_x
  
  // Prior for x
  x ~ normal(mu_x, sigma_x);
  
  // Measurement model
  x_meas ~ normal(x, tau);
  
 
}

generated quantities {

  

   real log_lik_x_meas[N];
 
  real log_prior = normal_lpdf(mu_x | 0, 10) + normal_lpdf(sigma_x | 0, 10);
  real x_pred[N];
  
  
  for (n in 1:N) {
    // Log-likelihood for measurement model
     log_lik_x_meas[n] = normal_lpdf(x_meas[n] | x[n], tau);
   
    
    // Predictive distribution for x
    x_pred[n] = normal_rng(mu_x, sigma_x);
    
  }
}
"

Thanks for the context, that makes it easier to think about. If you’re sure that there are no persistent level shifts (ie that one lab always has estimates that are too high or too low), then your approach is certainly the best way to go about it. “Modelling away” the measurement noise will reduce the variance of the true underlying values, but you should be cautious with the statistical validity if this is about reaching some significance threshold.

Specifically, if the measurement noise tau was obtained by repeatedly running the same blood sample through the same machine (at least 30 times), and then repeated with multiple blood samples across all labs, this would (probably) be a valid estimate for tau. If tau is however somehow derived from the same data set that you want to correct for measurement error, then you should not use this type of setup for variance reduction.

If you do suspect that that there is heterogeneity in the measurement quality across labs, (ie that certain labs are always reporting values that are either too high or too low), it would be very feasible to model the lab-specific bias, which would arguably give a more exciting model :)

1 Like

I’m not entirely sure what your intent is, but if it’s to build a measurement error model (a regression of y on x where there’s measurement error in x), then you may want to look at the User’s Guide chapter (same as linked by @mippeqf):

https://mc-stan.org/docs/stan-users-guide/measurement-error.html

Your measurement error is fine with x_meas ~ normal(x, tau), but then the trick to turning this into a measurement error model is to use x in a regression, e.g.,

y ~ normal(alpha + beta * x, sigma);

This uses the “true” values x rather than the measured values x_meas in the regression. Otherwise you don’t connect the measurement error model to anything as @mippeqf points out.

I don’t see how you can do this as the only information you have about the true x is coming from the normal(mu_x, sigma_x) prior, which will just give you something like an average of mu_x and x_meas for values of x—it could be computed analytically marginalizing out the x.

P.S.

I’d strongly recommend putting Stan code in its own files so line numbers in error messages are meaningful and you can use strings in your programs for debugging.

2 Likes

Thank you for the hints.
I was hoping to use this method to reduce the variance due to systematic error. As the example I provided, based measurement of blood hemoglobin level can be affected by systematic bias. For example, if blood sits in the lab for hours, that reduces the hemoglobin. However, another marker’s levels increase if blood sits for a long time. Given having information on both hemoglobin and the markers, and considering your and @mippeqf hints, I wonder if I can consider hemoglobin as “y” and the second marker which is a proxy of old blood (sitting in the lab for a long time) as “x”, and then compute the true value of hemoglobin. Does this make sense?

To paraphrase, you have marker x that tells you for roughly how long the blood sample sat around, and you want to use it to correct for hemoglobin breaking down over time, correct? Is this “decay over time” what you mean with systematic error? If so, does the decay happen to take place at an exponential rate (or some other known functional form)?

@mippeqf Marker x doesn’t tell us how long the blood sample sat around. It just increases and we do not know how much. The measure of hemoglobin here is prone to systematic error. I hope to estimate the true value of hemoglobin knowing its measured value and variance and knowing it doesn’t have this issue in another lab. I am unsure how marker x can help model the true value of Hemoglobin. I only have one time measurement of marker x and hemoglobin which are measured together.

I still don’t understand what you mean with systematic error. That term only means that there is some sort of bias on the variables, but we need structure to identify it. That is if either there is heterogeneity in the bias across labs (or time or people etc), or the bias is related to the marker x, or a combination of both, we can separate blood samples that are likely less biased from those that are likely more biased. Note that without some sort of structure, you cannot separate the bias in x from the mean of x.

As you correctly point out, we can use x as a predictor for y, and that would entirely follow the example on measurement error in the docs. But if you have information on who, when or where each sample was collected, you can (probably) reduce the variance of the predicted true values by a lot. Equally, if there is some (molecular) theory for how x relates to y, you can replace the linear relationship between y and x with something (curved) that will likely reduce the variance of the predicted true values further.

If you want to go with a simple baseline, you can follow the example from the docs linked to above. If you want to reduce the prediction error beyond this baseline, you can add more structure, as I have outlined.

Do you have access to hemoglobin measurement data from the other lab that does not have the problem? This would be sufficient to calibrate the model, even if you did not have data on x.

There is heterogeneity in the lab and we have data on the lab showing no bias. because there is inter-lab heterogeneity in bias and biological factors, I cannot adjust for labs in the model. Knowing the values in one lab can help to develeop th emodel without adjusting them as a fixed or random variable for lab locations. As I said adjustments will remove heterogeneity between labs in the biological factors. Thanks for your help.