Intermediate calculations of predictor variables with measurement error

Can anyone point me to an example of how to model measurement error when multiple sources of error are embedded in the final variable, whether predictor or response?

To clarify, in soil science we make multiple lab measurements that we use to calculate further measurements down the line, each of which has error. For instance, if I want to measure soil respiration, I will measure the amount of moisture in soil so I can calculate CO2 per gram of dry mass of soil. Before the CO2 measurement I will also adjust samples to 65% of their total water holding capacity, which means that I need to measure the total moisture that each sample can hold. When I measure these individual properties–moisture content and water holding capacity–I do it in replicates. Almost all soil scientists average replicates and use the average values to calculate some final value, so measurement error compounds in the final value.

The examples I’ve found of measurement error in Stan allows representing error in the final variables, whether predictor or response. Does anyone know of examples of how to represent multiple sources of compounding error in intermediate calculations?

This is fun. In general, the advice is to think in terms of a generative model. Here’s a specific example that ignores some syntax and pretends like normal distributions are appropriate for everything. x below is like the moisture measurement, y is like another measurement down the chain. The structure below uses a specific model to propagate uncertainty in the thing y is used to estimate (a) into the estimation of c. So: 1) build the generative model; 2) this will likely involve pretty seriously non-linear calculations (because such is chemistry); and 3) ask for help with sampling efficiency problems once you have something that runs. As a start I would just try to stack two measurements and get those estimates even if it means ignoring the real scientific question. Feel free to ask if you have a specific model you’re trying to code in Stan.

data {
  int N;
  real x[N];

  int K;
  real y[K];

parameters {
  real a;
  real b;
  real c;

model {
 a ~ normal(0,1);
  b ~ normal(0,1);
  c ~ normal(0,1);
 g ~ normal(0,1);

  x ~ normal(a, b);   // a is the first estimated parameter (and its sd);
 y ~ normal(c * x, g);,   //   c  is estimated from data y but also depends on x


Here’s a very old case study of multi-compartment soil carbon ODEs with measurement error:

The errors combine in a usual way and there’s no averaging. That doesn’t necessarily mean the compound to produce even more error. If the noise is random, then it’ll usually largely cancel out and you’ll get tighter and tighter estimates with more and more data.

I should’ve added that if you have predictors and want error in measurement for the predictors, then you can do it in the usual way. There’s a chapter in the manual on various types of measurement error models (noise, rounding) as well as a chapter on censoring (in case there’s an artificial bound on measurements due to the device).

Hi guys,

Thanks for the response. Here’s a more specific example to hopefully make the problem more tangible.

measurement_error_data.csv (2.7 KB)

When we measure microbial biomass, we report respiration–after adding a carbon substrate–per dry weight of soil. We collect replicated measurement on soil moisture (Gravimetric Moisture Content = GMC) as well as replicated measurement on soil respiration (Substrate Induced Respiration = SIR). We then normalize respiration by soil moisture content to get our final microbial biomass. Currently, almost everyone in the field does something that looks like this:


# Read in data on soil moisture content
gmc <- read_csv("measurement_error_data.csv", 
                col_types = cols(`CO2_flux_ug.C.h-1` = col_skip(), 
                                 sample = col_character(), soil_mass_ambient_g = col_skip()))
# Calculate percent moisture
gmc$gmc <- (1-(gmc$oven_dry_g/gmc$ambient_mass_g))*100

# Read in data on microbial biomass (respiration, after adding C substrate)
sir <- read_csv("measurement_error_data.csv", 
                   col_types = cols(ambient_mass_g = col_skip(), 
                                    oven_dry_g = col_skip(), sample = col_character()))

# Calculate average moisture for each sample  
avg.gmc <- aggregate(gmc$gmc~gmc$sample, FUN=mean)
names(avg.gmc) <- c('sample','gmc')

# Merge mean moisture into respiration data set    
merged <- merge(sir,avg.gmc,by='sample')

# Calculate soil moisture per dry weight of soil, based on average moisture value for each sample    
merged$`CO2_flux_ug.C.gsoil-1h-1` <- merged$`CO2_flux_ug.C.h-1`/
  (merged$soil_mass_ambient_g - (merged$soil_mass_ambient_g*(merged$gmc/100)))  

# Average two microbial biomass readings for each sample and report as single value <- aggregate(merged$`CO2_flux_ug.C.gsoil-1h-1`~merged$sample,FUN=mean)
names( <- c('sample','microbial_biomass')

I’m not quite sure of the right Stan syntax, but it seems like I’d want something like:

a ~ normal(0,1);
tau ~ normal(0,1);
rho ~ normal(0,1);
sigma ~ normal(0,1);

gmc ~ normal(a * gmc_measured, tau); 
mu = (sir/(soil.mass - (soil.mass*(gmc/100))));
y_measured ~ normal(b*mu, rho);
y ~ normal(y_measured, sigma);

Am I way off base with logic & syntax? How would I account for the fact that each sample should be getting its own distributions, for both moisture content and respiration, that are defined by the observed data? The way that I understand what I just wrote (and what was written by @sakredja) is that both x and y are defined by some universal distribution from which intermediate calculations are made, not that each distribution will be specific to each sample.

Just looking at the code you wrote, I think you can do this with generated quantities.

It seems like you’ve made measurements of ambient_mass_g, oven_dry_g, soil_mass_ambient_g, and CO2_flux_ug.C.h-1. Each thing you repeated a couple times (I dunno if you get these all together or not).

From the look of this code, if you knew the true values of each of these parameters, you could compute the thing you wanted. But you don’t know them, so you need to estimate them.

I think something like this should work:

ambient_mass_g_measured ~ normal(ambient_mass_g, sigma1);
oven_dry_g_measured ~ normal(oven_dry_g, sigma2);
soil_mass_ambient_g_measured ~ normal(soil_mass_ambient_g, sigma3);
CO2_flux_ug.C.h-1_measured ~ normal(CO2_flux_ug.C.h-1, sigma4);

Then in a generated quantities block you can compute all the things you want (that are a function of these four parameters) and you’ll get the uncertainty on the output you want.

Since you don’t have much data per sample (and each will have different parameters) then you’ll want to model the parameters hierarchically.

Does this make sense? Let me know if I’m missing something.