Estimating mean contaminant concentration in sediment using a biased dataset

At our engineering firm we help the US Environmental Protection Agency manage the cleanup of massive contaminated sediment sites. The concentration of contaminants like PCBs and PAHs in the sediment of these lakes and rivers is best modeled with a right-skewed distribution like the lognormal or gamma.

The primary objective of sampling is to delineate highly contaminated areas, so the dataset of measured concentrations are positively biased. This makes it tricky to estimate the mean concentration in a rigorous way. The typical approach is to draw Thiessen polygons around each sample point and use the areas in a weighted average to “decluster” the data, but it’s difficult to quantify the uncertainty of that point estimate.

I think a better alternative would be to simultaneously estimate the mean as well as the parameters of a variogram model describing the spatial autocorrelation? For the lognormal case with an exponential variogram model, this looks like:

x_i \sim \textrm{lognormal}(\mu, \Sigma)
\textrm{var}(x_i) = \sigma^{2}_{\textrm{sill}}
\textrm{cov}(x_i,x_j) = \gamma(h) = \sigma^{2}_{\textrm{nugget}} + (\sigma^{2}_{\textrm{sill}} - \sigma^{2}_{\textrm{nugget}})\left [ 1 - \exp \left ( - \frac{3h}{a} \right ) \right ]

Where x_i are the concentration measurements, \mu is the mean concentration that we’re trying to estimate, and the off-diagonal elements of covariance matrix \Sigma are specified by a variogram model \gamma with argument h (distance between x_i and x_j) and parameters a, \sigma^{2}_{\textrm{nugget}}, and \sigma^{2}_{\textrm{sill}}. I would need to figure out how to best specify the priors.

Does anybody see any potential problems with this approach? Any suggestions would be greatly appreciated. I’m excited to introduce powerful statistical tools like Stan to the toolkit of solutions used to deal with uncertainty at these large contaminated sites.

What you’ve proposed looks (at first glance) like a standard spatial (e.g. Gaussian process) model for continuous outcomes. Introducing the spatial correlation should alleviate the mean trend but it won’t magically create data points at the locations where you don’t have measurements.

A feature, not a bug, of such a spatial model is that your uncertainty estimates will probably be larger (more honest) than an approach that ignores the non-iid nature of the data. But it might get brought up as a criticism.

1 Like

I appreciate your input. I got to use HMC and other MCMC methods over the course of my PhD, but never in a Bayesian framework. It’s good to know that this idea passed the straight face test.

A feature, not a bug, of such a spatial model is that your uncertainty estimates will probably be larger (more honest) than an approach that ignores the non-iid nature of the data. But it might get brought up as a criticism.

Having a clear-eyed view of the uncertainty is really important. Often we want to make determinations as to whether the contamination is going away on its own due to natural processes, or if human intervention is needed. An overconfident set of mean concentration confidence intervals over time could easily give the wrong impression that cleanup is unwarranted.

Now that’s a worthwhile application!

Is the statement about \textrm{var}(x) meant to be a constraint on \Sigma?

Nope. As long as your covariance function leads to a positive definite matrix, the result is a log-Gaussian process. There’s a huge literature on GPs including an intro in a chapter of our User’s Guide. They can be really challenging to fit, so people sometimes back off from a full GP to really simple ICAR models, which are fast to fit. There, you just need adjacency between the areal units.

If you have enough data, the model might not be very sensitive to priors on the $\sigma$s. I would just go with something weakly informative as a prior (i.e., something that fixes the rough scale, not the value, such as normal(0, 5) if you expect to see values of roughly unit scale.

1 Like

There is some literature on geostatistical inference in cases like this. They usually refer to “preferential sampling” or “informative sampling locations.” Depending on how many samples you have in the lower concentration areas, you may end up with meaningful bias in your mean process, even if you include spatial autocorrelation (Gelfand et al. 2012). A model has been proposed for simultaneously modeling the sampling intensity and the quantity of interest (Diggle 2010, with an update in Dinsdale and Salibian-Barrera 2019), but drastically increases the model complexity by adding a spatial point process.

1 Like

@Bob_Carpenter

Thank you. Reading through that chapter of the User’s Guide, it sounds like I could adapt the example given for fitting the hyperparameters of a latent variable GP, with the following changes:

  1. An additional hyperparameter for the mean concentration \mu needs to be added so that f = mu + L_K * eta.
  2. The prior for the length-scale hyperparameter \rho can probably be changed from an inverse gamma to a gaussian. Based on extensive variography at sites like these, it’s reasonable to assume that the variogram’s range parameter is approximately 100 to 500 meters.
  3. The outcome needs to be changed from y ~ normal(f, sigma) to a lognormal or gamma, preferably the latter. Sediment concentration data tend to not be as positively skewed as would be specified by a lognormal distribution.

That statement about \textrm{var}(x) in my original post was meant to specify the diagonal elements of \Sigma. I now see that there are functions like cov_exp_quad which define the diagonal and off-diagonal elements simultaneously.

If you have enough data, the model might not be very sensitive to priors on the \sigma's.

The idea would be to apply this approach to datasets with approximately 30-50 samples.

@jkbest2

Thank you so much for the suggested references. I’m currently in the “not sure what to Google” phase of my literature search.

1 Like

You can probably work directly with normal outcome GP’s and \log y which will be much faster with the analytical observational model. Just have to keep track of the conversions between mean/sd of the log and mean/sd on the regular scale.

1 Like