I am trying to solve a problem in which there is only a fuzzy correspondence between observations of an independent variable x and observations of the dependent variable y. The scenario is from a manufacturing process where continuous measurements x are made on a component of a widget, whose performance is then tested (the test outcome is the continuous y). The objective is to develop a model that explains how much variation in the widget test value y is attributable to variation in the component measurement x.
The component is manufactured in lots, where each lot contains six to eight sublots. Each component lot is consumed in the production of one widget lot, but it is unknown which individual component goes into which widget, or even which component sublot goes into which portion of the final widget lot.
A crude way of approaching this could be to summarize x on a lot-by-lot basis (e.g., means, SDs), then use these summary statistics as the “observations” that have a one-to-one correspondence with values of y. In the design matrix, this means a constant value of x for all y by each widget lot. With a distributional model, the widget response mean could be a function of the component lot mean, something like y \sim \mu_x + (1|Lot), and the response variance is a function of the component lot variance, Var(y) \sim \sigma_x + (1|Lot).
I would like to incorporate additional information into the analysis, however, since from previous designed experiments we have obtained a causal model relating x with y, though it is marred by large uncertainties due to small sample size. First principles theory additionally indicates that x and y are positively correlated, although unobserved factors interact with x such that the correlation with y is not 1.
How would I go about building a proper model for this?
I had a little bit of trouble following your description, but the first step in the workflow is to try to write code to generate data akin to how your real data are structured. There are probably parameters on which you ultimately want to perform inference, but for this first step you can just choose reasonable values that generate reasonable looking data. Then post back here and I’ll advise on translating to inference.
Thanks for reminding me about the principled workflow! Here is a representation of the data generating process in R code for a single component lot with six sublots, and the corresponding widget test responses:
# Simulate component lot
process_mu = 10
sublot_sd = 5
btwn_sublot_sd = 3
# Get sublot means
n_sublots = 6
sublot_mu = rnorm(n_sublot, mean = process_mu, sd = btwn_sublot_sd)
# Get sublot samples
n_chips = 40
chip_rvs = MASS::mvrnorm(n_chips, mu = sublot_mu, Sigma = diag(n_sublot) * sublot_sd)
# Generate widget test value
b0 = 10
b1 = 3
b2 = 0.1
error_sd = 5
test_rvs = b0 + b1 * chip_rvs + b2 * chip_rvs^2 +
MASS::mvrnorm(n_chips, mu = rep(0, n_sublot), Sigma = diag(n_sublot) * error_sd)
This yields data similar to the following figures, which show two lots, each comprised of six subplots. “Thickness” is the dependent variable x and “Response” is the y:
The problem is that in practice we do not know which Response value y corresponds to which Thickness value x. That is, the only information we have from manufacturing are the marginal distributions of x and y for each lot (where a lot comprises six sublots).
From independent experimental data, we have an estimate of the equation for the widget test value, so we can probabilistically estimate the Response given Thickness. However, the experiment was only conducted using one lot of widgets and so does not capture the lot-to-lot variability due to unmeasured interacting factors, resulting in a biased and overly precise estimate.
My question is how to best integrate the manufacturing and experimental data. Is this even feasible? Conducting additional experiments will be very costly and is likely not an option.
If I understand correctly, I would try define the problem as a bayesian error model. For each unique value of y, I would estimate a predictor based on the lot mean and standard deviation. Let i be an observation from lot j, it would look like :
Thanks @ldeschamps. I would like to model the component sublot and lot as group-level effects as well, although the each Response lot is only traceable to the Component lot level. But would modeling as group-level terms add unnecessary complexity?
Is it possible to define something like this in the brms formula syntax, @paul.buerkner? My Stan language skills are not great.