# Heavy metals in soil: estimating max allowable field measurement

the problem
The western US has many areas where the soil is contaminated with lead, arsenic, and other harmful heavy metals due to mining activity. The concentration of these metals can be measured by

1. sending a sample to a laboratory (precise, but takes a few days) or;
2. scanning a sample in the field with an XRF spectrometer (fast, but less precise).

When these sites are being dug up to find and remove contaminated soil, it’s desirable to use field measurements of concentration to avoid delays and reduce costs. A dataset of paired lab and field measurements is typically collected to evaluate what field measurement threshold corresponds to a lab measurement above the regulatory limit. Typical methods for doing that analysis, however, are limited to OLS linear regression in Excel.

I have attempted to fit a deming regression model to a synthetic paired measurement dataset, but my variance estimates are consistently biased low. Any suggestions are welcome; this will be a very useful thing if the kinks can be worked out.

the data
In addition to 200 paired measurements, there are 30 triplicate sets of (unpaired) field and lab measurements. Both types of measurement typically display strong heteroscedasticity. A JSON file for these synthetic data is here.

the model
c_{\textrm{lab},i} \sim N(\mu_{\textrm{lab},i}, \nu_{\textrm{lab}}\mu_{\textrm{lab},i})

c_{\textrm{field},i} \sim N(\mu_{\textrm{field},i}, r\nu_{\textrm{lab}}\mu_{\textrm{field},i})

\mu_{\textrm{field},i} = \beta_0 + \beta_1 \mu_{\textrm{lab},i}

Where \nu_{\textrm{lab}} is the coefficient of variation (COV) for laboratory measurements, r is the ratio of the field COV to the lab COV (assumed to be greater than 1), and a linear relationship between mean field and lab measurements is specified by the coefficients \beta_0 and \beta_1 (shown as gray dashed line in first figure above).

Priors are given below. (I’m new to Bayesian statistics, so if anything here looks odd here please don’t hesitate to point it out!)
\beta_0 \sim N(0, 100)
\beta_0 \sim N(1, 1)
\nu_{\textrm{lab}} \sim N(0.5, 1)
r \sim N(4, 1)
\mu_{\textrm{lab},i} \sim N(c_{\textrm{field},i}, 0.5c_{\textrm{field},i})

the results
True parameter values used to generate the synthetic data are shown as green lines in these marginal posterior distributions. Estimates for the COV were consistently biased low for all the synthetic datasets I generated.

the code
deming_test.stan (2.4 KB)

data {
int n_obs; // number of paired measurements
int n_rep_sets; // number of replicate sets
int n_reps; // number of replicates per replicate set
vector[n_obs] c_lab_meas; // laboratory concentration observations
vector[n_obs] c_field_meas; // field concentration observations
array[n_rep_sets] vector[n_reps] c_lab_reps; // replicate laboratory measurements
array[n_rep_sets] vector[n_reps] c_field_reps; // replicate field measurements
}

parameters {
// true mean of lab-measured concentrations
vector<lower=0>[n_obs] c_lab_mu;

// linear relationship between lab measurements and field measurements
real beta0; // intercept
real beta1; // slope

// coefficient of variation parameters
// (assumes field measurements are more imprecise)
real<lower=0> cov_lab;
real<lower=1> cov_field_to_cov_lab_ratio;

// true means of replicate concentrations
vector<lower=0>[n_rep_sets] c_lab_rep_mu;
vector<lower=0>[n_rep_sets] c_field_rep_mu;
}

transformed parameters {
// true mean of field-measured concentrations
vector[n_obs] c_field_mu;
c_field_mu = beta0 + beta1*c_lab_mu;

// std deviations of paired lab measurements
vector<lower=0>[n_obs] c_lab_sigma;
c_lab_sigma = c_lab_mu*cov_lab;

// coefficient of variation of field measurements
real<lower=0> cov_field;
cov_field = cov_lab*cov_field_to_cov_lab_ratio;

// std deviations of field measurements
vector<lower=0>[n_obs] c_field_sigma;
c_field_sigma = c_field_mu*cov_field;

// std deviations of replicate lab measurements
vector<lower=0>[n_rep_sets] c_lab_rep_sigma;
c_lab_rep_sigma = c_lab_rep_mu*cov_lab;

// std deviations of replicate field measurements
vector<lower=0>[n_rep_sets] c_field_rep_sigma;
c_field_rep_sigma = c_field_rep_mu*cov_field;
}

model {
// priors
beta0 ~ normal(0,100);
beta1 ~ normal(1, 1);
cov_lab ~ normal(0.5, 1);
cov_field_to_cov_lab_ratio ~ normal(4, 1);
c_lab_mu ~ normal(c_lab_meas, 0.5*c_lab_meas);

// paired measurement likelihood
c_lab_meas ~ normal(c_lab_mu, c_lab_sigma);
c_field_meas ~ normal(c_field_mu, c_field_sigma);

// replicate measurement likelihood
for (i in 1:n_rep_sets) {
c_lab_reps[i] ~ normal(rep_vector(c_lab_rep_mu[i], n_reps), rep_vector(c_lab_rep_sigma[i], n_reps));
c_field_reps[i] ~ normal(rep_vector(c_field_rep_mu[i], n_reps), rep_vector(c_field_rep_sigma[i], n_reps));
}
}


I don’t know if it’s related to your problem but I noticed something: your parameters and transformed parameters are usually lower bounded (mostly at 0), but your distribution choice for all of the parameters is a normal, which has support over (-inf,inf). Why not use a more naturally [0,inf) bounded distribution, like lognormal, gamma, or log-logistic distribution? Also, is your synthetic data from a generative model that’s an inversion of the descriptive model?

The underlying distribution of concentrations present at a site is generally zero-bounded and positively skewed, but measurement error isn’t zero-bounded. There may be some positive-skew to the measurement error, though, so maybe a gamma-on-gamma deming regression is a possibility. It seemed like a good idea to get things working with basic normal distributions and then refining things if indicated by the model fit to real data.

The synthetic data generation process and the Stan model I’m fitting to those data are identical.

I think lab_cov is off because c_lab_meas is used twice in the model

    c_lab_mu ~ normal(c_lab_meas, 0.5*c_lab_meas);

// paired measurement likelihood
c_lab_meas ~ normal(c_lab_mu, c_lab_sigma);


If I remove the first line (so effectively assign c_lab a uniform prior) the posterior distribution of cov_lab moves to 0.25-0.35.

As for cov_field, I don’t think your generating process matches the Stan model. The model says that

c_field_meas ~ normal(c_field_mu, c_field_mu*cov_field);


In the histogram the green line indicates a true value of cov_field around 0.9, and if you draw 200 samples from normal(mu, 0.9*mu) you’re going to get 20-30 negative values; but in your synthetic data there are no negative values of c_field_meas.

1 Like

Ugh, it took copying my generative Python code into here to see my error, but you’re right on the money. My field measurements were generated using the lab standard deviation!

rng = np.random.default_rng(12558)

n_obs = 200
n_rep_sets = 30
n_reps = 3

beta0_true = 20
beta1_true = 1.2

cov_field = 0.9
cov_lab = 0.3

c_lab_mu = np.exp(rng.normal(loc=4.5, scale=1.3, size=n_obs))
c_lab_meas = rng.normal(loc=c_lab_mu, scale=c_lab_mu*cov_lab)
c_field_meas = rng.normal(
loc=(beta0_true + c_lab_true*beta1_true),
scale=c_lab_true*cov_lab
)


Subsequent model fits (with toned down COVs to reduce negative measurements) look much better.

The reason we use half-normal priors for things like scale parameters is that they’re consistent with values of 0 (like exponential, but unlike lognormal or general gamma cases).

If you have measurement error on a positive-constrained value, you can’t just use normal errors if there’s a chance of things becoming negative. Instead, you have to do something like alpha_true ~ lognormal(alpha_obs, sigma), which will keep alpha_true > 0.