Heavy metals in soil: estimating max allowable field measurement

jackelsey · September 3, 2024, 3:04am

the problem
The western US has many areas where the soil is contaminated with lead, arsenic, and other harmful heavy metals due to mining activity. The concentration of these metals can be measured by

sending a sample to a laboratory (precise, but takes a few days) or;
scanning a sample in the field with an XRF spectrometer (fast, but less precise).

When these sites are being dug up to find and remove contaminated soil, it’s desirable to use field measurements of concentration to avoid delays and reduce costs. A dataset of paired lab and field measurements is typically collected to evaluate what field measurement threshold corresponds to a lab measurement above the regulatory limit. Typical methods for doing that analysis, however, are limited to OLS linear regression in Excel.

I have attempted to fit a deming regression model to a synthetic paired measurement dataset, but my variance estimates are consistently biased low. Any suggestions are welcome; this will be a very useful thing if the kinks can be worked out.

the data
In addition to 200 paired measurements, there are 30 triplicate sets of (unpaired) field and lab measurements. Both types of measurement typically display strong heteroscedasticity. A JSON file for these synthetic data is here.

the model
c_{\textrm{lab},i} \sim N(\mu_{\textrm{lab},i}, \nu_{\textrm{lab}}\mu_{\textrm{lab},i})

c_{\textrm{field},i} \sim N(\mu_{\textrm{field},i}, r\nu_{\textrm{lab}}\mu_{\textrm{field},i})

\mu_{\textrm{field},i} = \beta_0 + \beta_1 \mu_{\textrm{lab},i}

Where \nu_{\textrm{lab}} is the coefficient of variation (COV) for laboratory measurements, r is the ratio of the field COV to the lab COV (assumed to be greater than 1), and a linear relationship between mean field and lab measurements is specified by the coefficients \beta_0 and \beta_1 (shown as gray dashed line in first figure above).

Priors are given below. (I’m new to Bayesian statistics, so if anything here looks odd here please don’t hesitate to point it out!)
\beta_0 \sim N(0, 100)
\beta_0 \sim N(1, 1)
\nu_{\textrm{lab}} \sim N(0.5, 1)
r \sim N(4, 1)
\mu_{\textrm{lab},i} \sim N(c_{\textrm{field},i}, 0.5c_{\textrm{field},i})

the results
True parameter values used to generate the synthetic data are shown as green lines in these marginal posterior distributions. Estimates for the COV were consistently biased low for all the synthetic datasets I generated.

the code
deming_test.stan (2.4 KB)

data {
    int n_obs; // number of paired measurements
    int n_rep_sets; // number of replicate sets
    int n_reps; // number of replicates per replicate set
    vector[n_obs] c_lab_meas; // laboratory concentration observations
    vector[n_obs] c_field_meas; // field concentration observations
    array[n_rep_sets] vector[n_reps] c_lab_reps; // replicate laboratory measurements
    array[n_rep_sets] vector[n_reps] c_field_reps; // replicate field measurements
}

parameters {
    // true mean of lab-measured concentrations
    vector<lower=0>[n_obs] c_lab_mu;

    // linear relationship between lab measurements and field measurements
    real beta0; // intercept
    real beta1; // slope

    // coefficient of variation parameters
    // (assumes field measurements are more imprecise)
    real<lower=0> cov_lab;
    real<lower=1> cov_field_to_cov_lab_ratio;

    // true means of replicate concentrations
    vector<lower=0>[n_rep_sets] c_lab_rep_mu;
    vector<lower=0>[n_rep_sets] c_field_rep_mu;
}

transformed parameters {
    // true mean of field-measured concentrations
    vector[n_obs] c_field_mu;
    c_field_mu = beta0 + beta1*c_lab_mu;

    // std deviations of paired lab measurements
    vector<lower=0>[n_obs] c_lab_sigma;
    c_lab_sigma = c_lab_mu*cov_lab;

    // coefficient of variation of field measurements
    real<lower=0> cov_field;
    cov_field = cov_lab*cov_field_to_cov_lab_ratio;

    // std deviations of field measurements
    vector<lower=0>[n_obs] c_field_sigma;
    c_field_sigma = c_field_mu*cov_field;

    // std deviations of replicate lab measurements
    vector<lower=0>[n_rep_sets] c_lab_rep_sigma;
    c_lab_rep_sigma = c_lab_rep_mu*cov_lab;

    // std deviations of replicate field measurements
    vector<lower=0>[n_rep_sets] c_field_rep_sigma;
    c_field_rep_sigma = c_field_rep_mu*cov_field;
}

model {
    // priors
    beta0 ~ normal(0,100);
    beta1 ~ normal(1, 1);
    cov_lab ~ normal(0.5, 1);
    cov_field_to_cov_lab_ratio ~ normal(4, 1);
    c_lab_mu ~ normal(c_lab_meas, 0.5*c_lab_meas);

    // paired measurement likelihood
    c_lab_meas ~ normal(c_lab_mu, c_lab_sigma);
    c_field_meas ~ normal(c_field_mu, c_field_sigma);

    // replicate measurement likelihood
    for (i in 1:n_rep_sets) {
        c_lab_reps[i] ~ normal(rep_vector(c_lab_rep_mu[i], n_reps), rep_vector(c_lab_rep_sigma[i], n_reps));
        c_field_reps[i] ~ normal(rep_vector(c_field_rep_mu[i], n_reps), rep_vector(c_field_rep_sigma[i], n_reps));
    }
}

Corey.Plate · September 3, 2024, 4:39pm

I don’t know if it’s related to your problem but I noticed something: your parameters and transformed parameters are usually lower bounded (mostly at 0), but your distribution choice for all of the parameters is a normal, which has support over (-inf,inf). Why not use a more naturally [0,inf) bounded distribution, like lognormal, gamma, or log-logistic distribution? Also, is your synthetic data from a generative model that’s an inversion of the descriptive model?

jackelsey · September 3, 2024, 5:31pm

The underlying distribution of concentrations present at a site is generally zero-bounded and positively skewed, but measurement error isn’t zero-bounded. There may be some positive-skew to the measurement error, though, so maybe a gamma-on-gamma deming regression is a possibility. It seemed like a good idea to get things working with basic normal distributions and then refining things if indicated by the model fit to real data.

The synthetic data generation process and the Stan model I’m fitting to those data are identical.

nhuurre · September 5, 2024, 6:36am

I think lab_cov is off because c_lab_meas is used twice in the model

    c_lab_mu ~ normal(c_lab_meas, 0.5*c_lab_meas);

    // paired measurement likelihood
    c_lab_meas ~ normal(c_lab_mu, c_lab_sigma);

If I remove the first line (so effectively assign c_lab a uniform prior) the posterior distribution of cov_lab moves to 0.25-0.35.

As for cov_field, I don’t think your generating process matches the Stan model. The model says that

c_field_meas ~ normal(c_field_mu, c_field_mu*cov_field);

In the histogram the green line indicates a true value of cov_field around 0.9, and if you draw 200 samples from normal(mu, 0.9*mu) you’re going to get 20-30 negative values; but in your synthetic data there are no negative values of c_field_meas.

jackelsey · September 8, 2024, 3:31am

Ugh, it took copying my generative Python code into here to see my error, but you’re right on the money. My field measurements were generated using the lab standard deviation!

rng = np.random.default_rng(12558)

n_obs = 200
n_rep_sets = 30
n_reps = 3

beta0_true = 20
beta1_true = 1.2

cov_field = 0.9
cov_lab = 0.3

c_lab_mu = np.exp(rng.normal(loc=4.5, scale=1.3, size=n_obs))
c_lab_meas = rng.normal(loc=c_lab_mu, scale=c_lab_mu*cov_lab)
c_field_meas = rng.normal(
    loc=(beta0_true + c_lab_true*beta1_true),
    scale=c_lab_true*cov_lab
)

Subsequent model fits (with toned down COVs to reduce negative measurements) look much better.

Bob_Carpenter · September 8, 2024, 5:08pm

The reason we use half-normal priors for things like scale parameters is that they’re consistent with values of 0 (like exponential, but unlike lognormal or general gamma cases).

If you have measurement error on a positive-constrained value, you can’t just use normal errors if there’s a chance of things becoming negative. Instead, you have to do something like alpha_true ~ lognormal(alpha_obs, sigma), which will keep alpha_true > 0.

Topic		Replies	Views
Estimating mean contaminant concentration in sediment using a biased dataset Modeling	6	235	April 29, 2024
Problems formulating cross-over model with measurement error and censored data Modeling specification	3	397	July 1, 2021
Simulation-based validation of sampling designs and prediction models for environmental monitoring Publicity gaussian-process , multivariate-normal	0	486	January 24, 2022
Estimating measurement error of a device while controlling for group-level effects brms	5	525	April 22, 2020
Pre- and post-treatment analysis with measurement errors Modeling	1	544	January 5, 2018

Heavy metals in soil: estimating max allowable field measurement

Related topics