Incorporating "latent" variable data into multiple models

I have data and functional forms for the following relationships (using brms syntax):

Model 1: y ~ x + (1 + x|Group)
Model 2: z ~ x + (1|Group)
Model 3: y ~ z + z^2 + (1 + z|Group)

My objective is to predict y from x; however, I only have very limited paired (x, y) data and need to predict outside the range of the available x values. On the other hand, I have more (x, z) and (z, y) data pairs that cover the required range. What I have been doing so far is using Model 2 to predict z for a range of x, then plugging zhat into Model 3 to get the corresponding range of yhat. This introduces a lot of model error, however (far beyond what would be expected of the true y = f(x) relationship).

I was wondering how to define a brms or Stan model in such a way that incorporates all the models and all the data at once?

Model 1 prediction (y vs. x)
image

Model 2 prediction (z vs x):
image

Model 3 prediction (y vs z):
image

Model 2 → 3 prediction:
image

Thank you!

1 Like

Hi :)

I think the missing value imputation is the way to go

Handle Missing Values with brms • brms (paul-buerkner.github.io)

It would lead to something looking like

bf(y ~ mi(x)) + 
bf(x | mi() ~ z)

I am not sure however if the random effect will accept the mi().

Hope that helps!
Lucas

1 Like

I suggest a bit of caution here, because model 1 says that y should be linear in x, but Models 2 and 3 together imply that y should be quadratic in x. I think your inference will be prone to do weird things if you try to insist on a posterior that incorporates all of these models at once.

If z is noisy or less tightly causally linked to at least one of {x, y} than the causal link between x and y, then it might well be the case that predicting x → z → y yields noisy predictions, and if you need this pathway to inform the location of the x → y relationship over big parts of the domain, then a noisy/uncertain answer might be the right one.

Thanks for the advice so far. The causal DAG is z → x and z → y. For some more context, x and y are just two different ways of measuring the value of z, which will be unknown in practice (outside of the experimental data I have here). Basically, we need a calibration equation that predicts y from x so we can adjust historical data for which we only have x and did not measure y.