Mismatched sampling rates between predictors & response plus measurement errors on categories


I’m unsure how to best model data from a widget manufacturing process with measurement “uncertainties” on categorical variables (relative to an ordered indexing variable) and an overall sparsity of measurements.

The widgets comprise several subcomponents that are put together on an assembly line. Subcomponents are produced elsewhere in batches, and the batches are fed into the assembly line more or less sequentially (i.e., as one batch of subcomponents is consumed, the next batch is added, but for some subcomponents there will be mixing of batches at this transition point).

The measurement “uncertainties” stem from the fact that we don’t know exactly which widgets contains which batch of each subcomponent - the best we can say is, e.g., that widgets 1 to ~100 contain subcomponent batch A, while ~101 to 200 contain batch B.

The sparse measurement aspect stems from the fact that dimensional measurements are made on all widgets by vision systems as they are assembled, but only every ~n^{th} widget is destructively tested for quality once the entire batch of widgets is completed. The test is simply pass/fail. A problem here is that it’s currently impossible to know exactly what the dimensional measurements are for a given tested widget because there is no one-to-one alignment of data - we just know approximately where the tested widget falls in the assembly sequence (i.e., the approximate index).

We know that widget quality can vary due to differences between subcomponent batches as well as on build order due to process drift. The dimensional variables measured on each widget are also known to impact quality.


The objective is to predict widget failure probability as a function of build order, subcomponent batches, and measurement data across the entire widget lot. My approach so far has been to model this as a Bayesian logistic mixed effects model in R using brms:

failrate ~ s(BuildOrder) + s(DimensionA) + s(DimensionA) + (1|SubcomponentB) + (1|SubcomponentB)

Here I’m using a dataset that is just the size of the destructively-tested sample and assumes complete knowledge of the properties of each tested widget. My question is how to best model the uncertainty around where the tested widgets fall in the build order, and by extension estimate whether a widget contains subcomponent lot A or B, what its dimensional measurements are. How would one define priors around these uncertainties?

This is, if you have 1000 widgets, built in order, you know that the failed widget is like the 175th?

This is in comparison to, if you have widget and there are 10 build steps, you know the widget failed a test at the 2nd or 3rd step?

This is, if you have 1000 widgets, built in order, you know that the failed widget is like the 175th?

Correct. The best we can say is that the widget is somewhere around the 175th with some probability.

Sorry for the delay getting back, this sounds like a measurement error model.

If you were dealing with the regression:

y \sim N(a x + b, \sigma)

Now we’re saying we didn’t measure x exactly but we have a vague idea what it would be, so now we can do something like:

x \sim N(\mu, \tau) \text{ // or some other sort of prior}\\ x_\text{measured} \sim N(x, \sigma_\text{measured})\\ y \sim N(a x + b, \sigma)

Where x is now a parameter that we’re gonna try to infer and $x_\text{measured} is the value we actually measured.

There’s a section in the manual here on it: https://mc-stan.org/docs/2_21/stan-users-guide/bayesian-measurement-error-model.html

Now the question is how to do this in brms I guess.

There’s a mechanism for measurement errors (?brms::me), but it doesn’t look like you can feed those into splines (the term s(me(x)) doesn’t work).

I think the issue is that the knots or whatnot in the splines are dependent on the x values (similar with gps). I think you’d need to switch to a basis representation in either splines or gps to make this technically possible.

So doing this would probably be pretty difficult technically.

Maybe before you figure out how to do that, try to use simulated data to figure out if it’s really necessary to do this more complicated model.

Like, simulate data from the big complicated thing (where your measurement of where the devices are failing aren’t so precise) and then fit the simple model assuming that your measurements aren’t actually noisy and see how biased things are vs. the uncertainty in your model.

Maybe you can use a preliminary fit to real data to generate your fake data so you’re kinda operating in the parameter space you want to operate in.

The thing you don’t want to do is pretend that this is the case:

x \sim N(x_\text{measured}, \sigma_\text{measured})

And impute values of x from this and then fit your model:

y \sim N(a x + b, \sigma)

a bunch and try to mix together the posteriors or anything. That seems like the simple thing to try to see if the true measurement error model is worth pursuing, but it’s not clear what it gets you. The way to figure out if you need the measurement error models is the fake data thing. (I initially thought imputing here would be good but I asked Gelman and he said no, that’s wrong. Do the full model if necessary and use fake data to figure out if you need the full model).

If it turns out that you need the measurement error model, then maybe make a new question and ask specifically a way to do measurement error models with splines in brms. And if that’s not possible you’ll have to write your own Stan stuff.

I like the simulation idea - thanks!