Understanding measurement error models w fake data (when the error is quadratically related to the measure)

I’m trying to understand the impact of measurement error on inferred parameters (reproducible example below).
In a study we are building a multilevel model of the performance of parents, and one of the performance of their children. the two models are separated because the tasks are not exactly the same and we need to build slightly different models.
We then extract the individual level estimates of performance on a log-odds scale (the model has a binomial likelihood), and produce summary stats (mean and sd).

We are interested in how the performance of child and parent relates, so we build a gaussian model where the performance of the child is predicted by that of the parent and add measurement error (the sd of the individual level posterior samples) to outcome and predictors. With only measurement error in the outcome, no issues and we see reasonable coefficients (0.3). With only measurement error in the predictor, no issues and we seem roughly the same coefficients. However, when we include both, things go crazy and the model infers coefficients of about 2 (vs. 0.3 before) and is quite confident about that.

Digging into the data, I can see that a possible issue is that the sd of the posterior samples is quadratically related to the mean (a beautiful parable!). The further away from the population mean, the bigger the uncertainty in the estimate of the participant. This is not due to the scale as it happens also if I convert the samples to probability (0-1) before summarizing them.

So:

  • what might be causing the strong quadratic correlation between mean and sd? Is that the partial pooling? Further away participants are dragged towards the population mean, but the posterior estimates is stretched by the dragging? the priors on the variance by participant are broad enough that regularization does not seem huge.

  • how should I deal with that sort of measurement error in the model?

Reproducible example (of the issues emerging from using measurement error):

pacman::p_load(tidyverse, MASS, brms)

# Simulating the input variables 
mu <- rep(0,2)
Sigma <- matrix(.3, nrow=2, ncol=2) + diag(2)*.7
rawvars <- mvrnorm(n=500, mu=mu, Sigma=Sigma)

ParentMean <- rawvars[,1]
ChildMean <- rawvars[,2]

ParentSD <- 0.79 + 0.06*ParentMean + 0.04*(ParentMean^2) + rnorm(length(ParentMean),0, 0.1)
ChildSD <- 0.54 + 0.04*ChildMean + 0.03*(ChildMean ^2) + rnorm(length(ChildMean),0, 0.05)

# Sanity check plots
plot(ParentMean, ChildMean)
plot(ParentMean, ParentSD)
plot(ChildMean, ChildSD)

d <- tibble(
  ParentMean, ParentSD, ChildMean, ChildSD
)


f0 <- bf(ChildMean ~ ParentMean)
f1_out <- bf(ChildMean | se(ChildSD) ~ ParentMean)
f1_pred <- bf(ChildMean ~ me(ParentMean, ParentSD))
f2 <- bf(ChildMean | se(ChildSD) ~ me(ParentMean, ParentSD))

p0 <- c(
  prior(normal(0, 1), class=Intercept),
  prior(normal(0, 1), class=b),
  prior(normal(1, 0.5), class=sigma)
)

p1_out <- c(
  prior(normal(0, 1), class=Intercept),
  prior(normal(0, 1), class=b)
)

p1_pred <- c(
  prior(normal(0, 1), class=Intercept),
  prior(normal(0, 1), class=b),
  prior(normal(1, 0.5), class=sigma)
)

p2 <- c(
  prior(normal(0, 1), class=Intercept),
  prior(normal(0, 1), class=b)
)

m0 <- brm(
  f0,
  d,
  family=gaussian,
  prior=p0,
  sample_prior=T,
  chains = 2,
  cores = 2,
  save_mevars = T
)

m1_out <- brm(
  f1_out,
  d,
  family=gaussian,
  prior=p1_out,
  sample_prior=T,
  chains = 2,
  cores = 2,
  save_mevars = T
)

m1_pred <- brm(
  f1_pred,
  d,
  family=gaussian,
  prior=p1_pred,
  sample_prior=T,
  chains = 2,
  cores = 2,
  save_mevars = T
)

m2 <- brm(
  f2,
  d,
  family=gaussian,
  prior=p2,
  sample_prior=T,
  chains = 2,
  cores = 2,
  save_mevars = T
)

This gives betas for the relation between child and parent of:

  • m0: 0.29, 0.05
  • m1_out: 0.25 0.03
  • m1_pred: 1.67, 0.38
  • m2: 2.20, 0.27
1 Like

You should be able to do inference on the correlation of interest using all the raw data in a single model (rather than three as you describe), even when the data for the children and parents require different model structures. See here, in particular the “(… | p | id )” syntax which lets you link quantities that you want to correlate.

thanks. I have some modeling in progress on those lines, but as far as I can see to use two different datasets I should go straight to stan right?

@paul.buerkner can one use multiple datasets as input to brms to achieve this inference scenario?

I am not sure if in this was. One could use multiple data subsets via the resp_subset addition term and one could pool multiple data via brm_multiple, but I think none of which is what you want as far as I understand.

2 Likes

yes exactly :-)

1 Like

So, resp_subset() seems to be the solution. But can I combine two elements: | trials() and | resp_subset() in the same formula?

well, apparently I can, now diagnosing to see if that messes up anything

yes as y | trials() + subset() ~ …

awesome, thanks!