the problem
The western US has many areas where the soil is contaminated with lead, arsenic, and other harmful heavy metals due to mining activity. The concentration of these metals can be measured by
- sending a sample to a laboratory (precise, but takes a few days) or;
- scanning a sample in the field with an XRF spectrometer (fast, but less precise).
When these sites are being dug up to find and remove contaminated soil, it’s desirable to use field measurements of concentration to avoid delays and reduce costs. A dataset of paired lab and field measurements is typically collected to evaluate what field measurement threshold corresponds to a lab measurement above the regulatory limit. Typical methods for doing that analysis, however, are limited to OLS linear regression in Excel.
I have attempted to fit a deming regression model to a synthetic paired measurement dataset, but my variance estimates are consistently biased low. Any suggestions are welcome; this will be a very useful thing if the kinks can be worked out.
the data
In addition to 200 paired measurements, there are 30 triplicate sets of (unpaired) field and lab measurements. Both types of measurement typically display strong heteroscedasticity. A JSON file for these synthetic data is here.
the model
c_{\textrm{lab},i} \sim N(\mu_{\textrm{lab},i}, \nu_{\textrm{lab}}\mu_{\textrm{lab},i})
c_{\textrm{field},i} \sim N(\mu_{\textrm{field},i}, r\nu_{\textrm{lab}}\mu_{\textrm{field},i})
\mu_{\textrm{field},i} = \beta_0 + \beta_1 \mu_{\textrm{lab},i}
Where \nu_{\textrm{lab}} is the coefficient of variation (COV) for laboratory measurements, r is the ratio of the field COV to the lab COV (assumed to be greater than 1), and a linear relationship between mean field and lab measurements is specified by the coefficients \beta_0 and \beta_1 (shown as gray dashed line in first figure above).
Priors are given below. (I’m new to Bayesian statistics, so if anything here looks odd here please don’t hesitate to point it out!)
\beta_0 \sim N(0, 100)
\beta_0 \sim N(1, 1)
\nu_{\textrm{lab}} \sim N(0.5, 1)
r \sim N(4, 1)
\mu_{\textrm{lab},i} \sim N(c_{\textrm{field},i}, 0.5c_{\textrm{field},i})
the results
True parameter values used to generate the synthetic data are shown as green lines in these marginal posterior distributions. Estimates for the COV were consistently biased low for all the synthetic datasets I generated.
the code
deming_test.stan (2.4 KB)
data {
int n_obs; // number of paired measurements
int n_rep_sets; // number of replicate sets
int n_reps; // number of replicates per replicate set
vector[n_obs] c_lab_meas; // laboratory concentration observations
vector[n_obs] c_field_meas; // field concentration observations
array[n_rep_sets] vector[n_reps] c_lab_reps; // replicate laboratory measurements
array[n_rep_sets] vector[n_reps] c_field_reps; // replicate field measurements
}
parameters {
// true mean of lab-measured concentrations
vector<lower=0>[n_obs] c_lab_mu;
// linear relationship between lab measurements and field measurements
real beta0; // intercept
real beta1; // slope
// coefficient of variation parameters
// (assumes field measurements are more imprecise)
real<lower=0> cov_lab;
real<lower=1> cov_field_to_cov_lab_ratio;
// true means of replicate concentrations
vector<lower=0>[n_rep_sets] c_lab_rep_mu;
vector<lower=0>[n_rep_sets] c_field_rep_mu;
}
transformed parameters {
// true mean of field-measured concentrations
vector[n_obs] c_field_mu;
c_field_mu = beta0 + beta1*c_lab_mu;
// std deviations of paired lab measurements
vector<lower=0>[n_obs] c_lab_sigma;
c_lab_sigma = c_lab_mu*cov_lab;
// coefficient of variation of field measurements
real<lower=0> cov_field;
cov_field = cov_lab*cov_field_to_cov_lab_ratio;
// std deviations of field measurements
vector<lower=0>[n_obs] c_field_sigma;
c_field_sigma = c_field_mu*cov_field;
// std deviations of replicate lab measurements
vector<lower=0>[n_rep_sets] c_lab_rep_sigma;
c_lab_rep_sigma = c_lab_rep_mu*cov_lab;
// std deviations of replicate field measurements
vector<lower=0>[n_rep_sets] c_field_rep_sigma;
c_field_rep_sigma = c_field_rep_mu*cov_field;
}
model {
// priors
beta0 ~ normal(0,100);
beta1 ~ normal(1, 1);
cov_lab ~ normal(0.5, 1);
cov_field_to_cov_lab_ratio ~ normal(4, 1);
c_lab_mu ~ normal(c_lab_meas, 0.5*c_lab_meas);
// paired measurement likelihood
c_lab_meas ~ normal(c_lab_mu, c_lab_sigma);
c_field_meas ~ normal(c_field_mu, c_field_sigma);
// replicate measurement likelihood
for (i in 1:n_rep_sets) {
c_lab_reps[i] ~ normal(rep_vector(c_lab_rep_mu[i], n_reps), rep_vector(c_lab_rep_sigma[i], n_reps));
c_field_reps[i] ~ normal(rep_vector(c_field_rep_mu[i], n_reps), rep_vector(c_field_rep_sigma[i], n_reps));
}
}