Estimating measurement error of a device while controlling for group-level effects

I’m trying to estimate the mean measurement error associated with a measurement device for concentration of a chemical in solution. The measurement device uses a disposable component whose performance can vary between different manufacturing lots of the component. Several test solutions were prepared in a lab by adding known quantities of solute to stock solutions whose true concentrations of solute are only known to lie within a large interval (think taking several samples of ocean water from different locations and adding known quantities of salt to each sample).

Known variables are the following:

  • Sample site: “Site” (categorical)
  • Sample spike level: “Level” (categorical)
  • Device component lot ID: “Lot” (categorical)
  • Measured concentration: “y” (continuous)

Each categorical variable has 10 levels. Two to four sites and component lots are used at each spike level, and there may be interaction between site and spike level.

An original analysis of the data simply mean-centered the data by the mean of each “Site” and calculated the variance of all this mean-centered data lumped together. This obviously fails to account for explainable variability due to the aforementioned predictors and, based on my understanding, results in an overly optimistic estimate of the standard error of the measurement.

I’ve done model comparison between several new models using brms :

  1. y ~ Level
  2. y ~ Level + (1|Site)
  3. y ~ Level + (1|Site) + (1|Lot)
  4. y ~ Level + (Level|Site) (1|Lot) (I ran into divergence issues and low ESS here, so omitted model 4 from LOO comparisons).

The loo_compare() output is:

elpd_diff se_diff
fit_list[[2]]  0.0       0.0   
fit_list[[3]] -0.8       0.9   
fit_list[[1]] -8.3       3.2 

and from loo_model_weights():

Method: stacking
fit_list[[1]] 0.000 
fit_list[[2]] 1.000 
fit_list[[3]] 0.000 

Predicted mean responses for models 1 and 3 are shown in the two figures below. The prediction at each level is broader for Model 3 vs Model 1.

Model 1

Model 1

Model 3

Model 3

However, the main quantity of interest is the measurement uncertainty for an arbitrary Level, Site, and Lot. For this, I generated predictions for each model using the following specification:

newdata = data.frame(Level = 'zero__', Site='NewSite', Lot='NewLot')
brms::fitted(mdl, newdata, allow_new_levels = TRUE, sample_new_levels = 'uncertainty')

which yields

Mean predictions for new data

My question is, which model best represents measurement uncertainty of the device? The distributions for Models 2, 3, and 4 are highly skewed (and that for model 4 is oddly bimodal) and much broader than the naive approach and Model 1, which I am sure will raise questions with my stakeholders. Any insight is much appreciated!

It seems model 2 is better when the future Level, Site, and Lot are randomly drawn from the sampling population. Do you expect a targeted prediction that might be different from your training sample?

I suppose my issue is really one of figuring out how to convince stakeholders than Model 2 is better than the others, which means I need to better understand how the LOO metrics are actually calculated.

The LOO metrics clearly place Model 2 at the top, but it’s prediction uncertainty is much larger than Model 1, which (from a naive perspective) seems counter-intuitive at first glance. Are the LOO scores telling us that Model 2 is better at estimating the entire distribution of the data (given some sort of distance metric), or just the expected value? I know of several papers and long youtube lectures that discuss LOO in a Bayesian context, but is there a more accessible definition I could use to get the point across more easily to lay people?

Yes, I think the default used LOO log score examines the whole predictive distribution rather than a point estimation. A wider predictive interval suggests the data has larger uncertainty by nature.

Here’s a vignette for stacking which I like:

Aki has a lot of stuff on his website: (also a shorter thing on loo here:

This might be the place to start, though there’s a lot of stuff to dig through.

If you fit a line to something that isn’t a line, the predictions from that line might be too certain because model mispecification. In this case you’d prefer the wider predictions, cause those are closer to the reality.

1 Like

Have a look at the stacking paper:

I think once you get through section 3.1 you have the definitions.

If you need to communicate this is sort of a handwavy way, I think start with the source. If I tried to write out a handwavy explanation, I’d probably say something wrong and make it more confusing than it needs to be.

1 Like