Hi everyone,
I would highly appreciate feedback for my model. Do you think there are better ways to model this in stan?
We measured our predictor hc_obs, which we want to use to predict y. hc_obs is an attempt to measure hc_true, which unfortunately does not work without measurement error. Subjects who have higher hw will have reduced reduced hc_obs as a technical artifact while when hw == 0, there is no measurement error (no person has hw == 0). See the DAG corresponding to our data (grey variables are unobserved):
I thought just including hw in a regression like this will be sufficient:
y \sim bmi + hc_obs + hw
This way the model would answer: Keeping hw and bmi constant, what effect would increasing hc_obs have on y. Therefore, the error is controlled for. I tried to simulate this in R:
library(tidyverse)
library(brms)
N <- 1e4
bmi <- rnorm(n = N, mean = 0, sd = 2)
hc <- rnorm(mean = 0, sd = 2, n = N)
hw <- rnorm(n = N, mean = 0, sd = 1)
bHW <- -0.5
bHC <- -1
bBMI <- 0.5
hc_obs <- rnorm(mean = hc + bHW * hw, sd = 1, n = N)
y <- rnorm(n = N, mean = bHC * hc + bBMI * bmi, sd = 1)
tb <- tibble(bmi, hc, hw, hc_obs, y)
m1 <- brm(
y ~ bmi + hc_obs + hw,
data = tb
)
summary(m1)
I have an unrealistically high sample size here but it illustrates that this model cannot restore the correct effect estimates in the simulation example. It does not really make a difference whether I include hw or not. The estimates are off in a similar way.
How would you model this differently (I would also welcome stan code). Given my DAG, I am thinking of some kind of measurement error model but all my attempts so far did not work out. For example I tried the following stan model:
data{
int<lower=1> N_subj;
vector[N_subj] y;
vector[N_subj] bmi;
vector[N_subj] hw;
vector[N_subj] hc_obs;
}
parameters{
real a_y;
real bBMI;
real<lower=0> sigma_y;
real bHC;
real a_hc_obs;
real bHW;
real<lower=0> sigma_hc_obs;
}
model{
vector[N_subj] mu_hc_obs;
sigma_hc_obs ~ exponential(1);
bHW ~ normal(0, 1);
a_hc_obs ~ normal(0, 1);
for (i in 1:N_subj) {
mu_hc_obs[i] = a_hc_obs + hw[i] * bHW;
}
hc_obs ~ normal(mu_hc_obs ,sigma_hc_obs);
vector[N_subj] mu_y;
sigma_y ~ exponential(1);
bBMI ~ normal(0, 1);
a_y ~ normal(0, 1);
for (i in 1:N_subj) {
mu_y[i] = a_y + bmi[i] * bBMI + (hc_obs[i] + bHW * hw[i]) * bHC;
}
y ~ normal(mu_y ,sigma_y);
}
generated quantities{
vector[N_subj] log_lik;
vector[N_subj] mu_y;
for (i in 1:N_subj) {
mu_y[i] = a_y + bmi[i] * bBMI + (hc_obs[i] + bHW * hw[i]) * bHC;
}
for (i in 1:N_subj) log_lik[i] = normal_lpdf(y[i] | mu_y[i], sigma_y);
}
That model is similarly off regarding my main predictor of interest. The estimate for bHW is completely off there. I tried also a model where I specificy
for (i in 1:N_subj) {
sigma_hc_obs[i] = hw[i] * bHW;
}
hc_obs ~ normal(mu_hc_true, sigma_hc_obs);
My attempt here was that for each individual I get an estimate for hc_true.
Any help how to model this would be greatly appreciated. Also feel free to point me to educational resources where I can learn to model this myself. In Statistical Rethinking, measurement error was covered but not for my specific case where I actually have variable(s) available that can explain this error.
Thanks,
Carlos