I’m trying to understand the impact of measurement error on inferred parameters (reproducible example below).
In a study we are building a multilevel model of the performance of parents, and one of the performance of their children. the two models are separated because the tasks are not exactly the same and we need to build slightly different models.
We then extract the individual level estimates of performance on a log-odds scale (the model has a binomial likelihood), and produce summary stats (mean and sd).
We are interested in how the performance of child and parent relates, so we build a gaussian model where the performance of the child is predicted by that of the parent and add measurement error (the sd of the individual level posterior samples) to outcome and predictors. With only measurement error in the outcome, no issues and we see reasonable coefficients (0.3). With only measurement error in the predictor, no issues and we seem roughly the same coefficients. However, when we include both, things go crazy and the model infers coefficients of about 2 (vs. 0.3 before) and is quite confident about that.
Digging into the data, I can see that a possible issue is that the sd of the posterior samples is quadratically related to the mean (a beautiful parable!). The further away from the population mean, the bigger the uncertainty in the estimate of the participant. This is not due to the scale as it happens also if I convert the samples to probability (0-1) before summarizing them.
So:
-
what might be causing the strong quadratic correlation between mean and sd? Is that the partial pooling? Further away participants are dragged towards the population mean, but the posterior estimates is stretched by the dragging? the priors on the variance by participant are broad enough that regularization does not seem huge.
-
how should I deal with that sort of measurement error in the model?
Reproducible example (of the issues emerging from using measurement error):
pacman::p_load(tidyverse, MASS, brms)
# Simulating the input variables
mu <- rep(0,2)
Sigma <- matrix(.3, nrow=2, ncol=2) + diag(2)*.7
rawvars <- mvrnorm(n=500, mu=mu, Sigma=Sigma)
ParentMean <- rawvars[,1]
ChildMean <- rawvars[,2]
ParentSD <- 0.79 + 0.06*ParentMean + 0.04*(ParentMean^2) + rnorm(length(ParentMean),0, 0.1)
ChildSD <- 0.54 + 0.04*ChildMean + 0.03*(ChildMean ^2) + rnorm(length(ChildMean),0, 0.05)
# Sanity check plots
plot(ParentMean, ChildMean)
plot(ParentMean, ParentSD)
plot(ChildMean, ChildSD)
d <- tibble(
ParentMean, ParentSD, ChildMean, ChildSD
)
f0 <- bf(ChildMean ~ ParentMean)
f1_out <- bf(ChildMean | se(ChildSD) ~ ParentMean)
f1_pred <- bf(ChildMean ~ me(ParentMean, ParentSD))
f2 <- bf(ChildMean | se(ChildSD) ~ me(ParentMean, ParentSD))
p0 <- c(
prior(normal(0, 1), class=Intercept),
prior(normal(0, 1), class=b),
prior(normal(1, 0.5), class=sigma)
)
p1_out <- c(
prior(normal(0, 1), class=Intercept),
prior(normal(0, 1), class=b)
)
p1_pred <- c(
prior(normal(0, 1), class=Intercept),
prior(normal(0, 1), class=b),
prior(normal(1, 0.5), class=sigma)
)
p2 <- c(
prior(normal(0, 1), class=Intercept),
prior(normal(0, 1), class=b)
)
m0 <- brm(
f0,
d,
family=gaussian,
prior=p0,
sample_prior=T,
chains = 2,
cores = 2,
save_mevars = T
)
m1_out <- brm(
f1_out,
d,
family=gaussian,
prior=p1_out,
sample_prior=T,
chains = 2,
cores = 2,
save_mevars = T
)
m1_pred <- brm(
f1_pred,
d,
family=gaussian,
prior=p1_pred,
sample_prior=T,
chains = 2,
cores = 2,
save_mevars = T
)
m2 <- brm(
f2,
d,
family=gaussian,
prior=p2,
sample_prior=T,
chains = 2,
cores = 2,
save_mevars = T
)
This gives betas for the relation between child and parent of:
- m0: 0.29, 0.05
- m1_out: 0.25 0.03
- m1_pred: 1.67, 0.38
- m2: 2.20, 0.27