Chemistry calibration curve

I would like to use stan to create and use a calibration curve. The word “calibration” as I use it here reflects its meaning in analytical chemistry, not statistics.

A calibration curve consists of two parts

  1. A regression model (usually linear, often a line with parameters for the slope and intercept). Typically we have standard solutions with known x values (e.g. concentrations) for which we have measured a set of corresponding y values (intensities). We estimate alpha (the intercept) and beta (the slope) . No problems there.

  2. Next we measure the intensities (y values) of several unknowns for which we wish to know the concentration (x value), and suppose that each unknown is measured in triplicate.

How can I use stan to estimate the posterior distributions of the concentrations (x values) corresponding to each unknown sample?

Here is my stan code so far, I’ve detailed my questions inline in the comments

data {
  // variables from the calibration curve data are prefixed with cal_
  int<lower=1> cal_N;          // number of standards
  vector[cal_N] cal_meas;    // measured intensities of the standards
  vector[cal_N] cal_known;  // known values of concentrations of the standards

  // Variables to do so-called inverse prediction:  predict x values from y values
  int<lower=1> meas_N;       // number of unknown intensity measurements
  vector[meas_N] y_meas;   // measured intensity values of the unknowns

  // I also need to be able to do ordinary forward prediction:
  // given new x values  predict the y's
  int<lower=1> new_x_N;
  vector[new_x_N] new_x;

}
parameters {
  real alpha;
  real<lower=0> beta;
  real<lower=0> sigma;
}
model {
  // Step one:  get the calibration curve
  cal_meas ~ normal(alpha + beta * cal_known, sigma);
  sigma ~ exponential(1);  // standard deviation is positive
  // I know the slope is about 750 because that's the way I made the data
  // There's a better prior to be had, but that's a different question of the day
  beta ~ normal(750, 100);

}
generated quantities {
  vector[new_x_N] new_y;
  vector[meas_N] x_inv_pred;
  real<lower=0> sigma_x;


  //----------------------------------------------------------------------
  // Q1:  How can I get stan to estimate the value of the scale factor for the x values?

  // A poor solution:
  // Inverse predict the value of the standard deviation of x from the value we found for sigma
  // sigma_x = (sigma - alpha)/beta
  // This won't work in general, because the error on the measurement variable could be much less
  // than the error on the intercept and then sigma_x would be negative

  // Another poor solution:
  // Pretend the intercept doesn't matter and divide by the slope
  sigma_x = sigma/beta;
  //-------------------------------------------------------------------------

  //-------------------------------------------------------------------------
  // Q2:  How to predict x values from my new y values?

  // An unsatisfactory method--
  //    -sigma_x is not well estimated
  //    -assuming normal distribution for x values--I'd like stan to figure out what the distribution is
  //    -errors in new y measurements are not taken into account
  for (n in 1:meas_N) {
    // Assume that x values predicted by the inverted equation
    // are normally distributed, because don't know how to specify
    // their actual distribution as found in the posterior
    x_inv_pred[n] = normal_rng((y_meas[n] - alpha) / beta, sigma_x);
  }
  //-------------------------------------------------------------------------

  // predict y's from new x's
  for (n in 1:new_x_N) {
    new_y[n] = normal_rng(alpha + beta * new_x[n], sigma);
  }

}

The data represents notional measurements of molybdenum concentrations and is taken from Hunter, 1984 (https://doi.org/10.1007/978-94-017-1026-8_3), but I made the problem more realistic by multiplying the measured y values by 747, as if they represented values obtained from an instrument. For the sake of this problem, assume the measured values are intensities, and the known values are concentrations in parts per billion (ppb)
Mb_data.csv (263 Bytes)

I’m working in R, using rstan.

Let’s say I have one new concentration to estimate, and I’ve recorded the following triplicate measurement, and I pass that data to the stan model when sampling, along with the calibration data:

new_Mb_data <- data.frame(y_meas = c(850, 867, 848))

d <- list(cal_meas = Mb_data$meas, cal_known = Mb_data$known, cal_N = nrow(Mb_data),
          meas_N = nrow(new_Mb_data), y_meas = new_Mb_data$y_meas,
          new_x_N = 2, new_x = c(0, 1))

# compile the model
comp_cal_model <- stan_model("calibration_regression_predict_x_and_y_Qoftheday.stan")
# sample
fit_model_cal <- sampling(comp_cal_model, data = d, chains = 4, seed = "123" )
# print the results
fit_model_cal

To repeat the questions in the stan code,
Q1: How can I get stan to estimate the value of the scale factor for the x values?
and
Q2: How to obtain a posterior distribution of x values from my new y values, taking into account the error in the new y measurements, a better estimate of sigma_x, and the fitted calibration curve?
Q3: I’ve read that the items in the generated quantiles block are calculated every time a sample is performed, and I think that means that those values are effectively using the posterior distribution of the parameters alpha and beta, is that correct?

Three final comments

  1. I think that this problem is a tiny example of “the inverse problem”, but the papers I’ve found that deal with that subject use models that are way more complicated and beyond my current comprehension.
  2. The real model I’m working towards will take into account non-negligible errors in the known concentrations used to generate the initial regression, heteroscedascity, will have a better prior for the slope, and non-linear or spline regression, and perhaps pool data from multiple curves performed on the same instrument for different analytes…
  3. Another way of solving this problem would be to use the posterior of the regression as the prior for a new, inverse regression model. I’ve seen a paper that advocates using quantiles of a posterior as priors for a future model. Sorry, can’t find the exact reference at the moment.

Thanks

That is correct.

Bayesian theory allows treating all unknown quantities as parameters. You can move x_inv_pred into parameters block and let Stan solve the inverse problem for you. This is a measurement error model.

parameters {
  ...
  real<lower=0> sigma;
  vector[meas_N] x_inv_pred;
}
model {
  // Measurements
  y_meas ~ normal(alpha + beta * x_inv_pred, sigma);
  // Step one:  get the calibration curve
  cal_meas ~ normal(alpha + beta * cal_known, sigma);
  sigma ~ exponential(1);  // standard deviation is positive
  ...
}
1 Like

Thank you for the quick reply. I never would have made the connection with a measurement error model.