Estimating measurement error model mean and variance from data

My question is an extension of section 6.1 in the Stan User’s Guide.

I have a model of the form y ~ x, where x is defined as x = a / b. The variable a is some lab-based measurement and b is the mass of the sample on which it’s measured. Both a and b are measured in the lab with analytical repeats. Typically, these repeats are averaged together and a is divided by b to generate x before generating a ‘final’ data set that is used for modeling. This obviously ignores any measurement error associated with a and b.

What I would like to do is pass the full raw data to Stan for both a and b (a_meas and b_meas) which would have a length 3x the length of y because they’re measured with 3 repeats per sample. I’d like to estimate separate parameters for each sample of a, where the parameter’s mean is the mean of the three repeats repeats and the sigma is the sigma of the three measured repeats. I want to do the same for b. And then divide the two to generate x, a vector of parameters, that represent the “true value” of x, accounting for this measurement error.

In the example in the User’s Guide, the following syntax was used:

data {
  int<lower=0> N; // number of cases
  vector[N] y; // outcome (variate)
  real x_meas[N]; // measurement of x (co-variate)
  real<lower=0> tau; // measurement noise
}
parameters {
  real x[N]; // unknown true value
  real mu_x; // prior location
  real sigma_x; // prior scale

  real alpha; // intercept
  real beta; // slope
  real<lower=0> sigma; // outcome noise
}
model {
  x ~ normal(mu_x, sigma_x); // prior
  x_meas ~ normal(x, tau); // measurement model
  y ~ normal(alpha + beta * x, sigma);
} 

This example doesn’t apply to my case because there’s a fixed error across all of the samples and the mean is only drawn from one observation, the single measured value.

I’m having trouble figuring out how to modify this to solve my problem. I’m imagining something like:

data {
  int<lower=0> N; // number of cases in y
  int<lower=0> J; // number of cases in a and b, including replicates
  int<lower=0> R; // number of replicates per sample
  vector[N] y;    // outcome (variate)
  real a_meas[J]; // measurement of a (co-variate)
  real b_meas[J]; // measurement of b (co-variate)
}
parameters {
  real a[N]; // unknown true value for a  
  real mu_a[N]; // prior location
  real sigma_a[N]; // prior scale

  real b[N]; // unknown true value for b  
  real mu_b[N]; // prior location
  real sigma_b[N]; // prior scale

  real alpha; // intercept
  real beta; // slope
  real<lower=0> sigma; // outcome noise
}
generated quantities {
  // Is this the right place to generate x, 
  // which would be the parameters a[N] and b[N] divided by each other?
  real x[N] = a / b; 
}
model {
  // I'm not sure how to properly specify the model section to 
  // general parameters for a and b based on the three 
  // measurements of a and b for every sample
  for(i in 1:N){  
    for(i in 1:R){ 
      a ~ normal(mu_a, sigma_a);
    } 
  } 
  for(i in 1:N){  
    for(i in 1:R){ 
      b ~ normal(mu_b, sigma_b);
    } 
  } 

  y ~ normal(alpha + beta * x, sigma);
} 

Thanks for any help and direction you can offer.

Hi Stephen! Could you post a small representative example data set?

Here’s some R code to generate a sample data set.

# Generate y data which are n=10
y.data <- tibble::tibble(
  y=rnorm(10,5,1),
  sample.id = c('sample1','sample2','sample3','sample4','sample5','sample6','sample7','sample8','sample9','sample10')
)

# Generate x data which are n=30 (10 samples x 3 repeats)
x.data <- tibble::tibble(
  a=rnorm(30,20,3),
  b=rnorm(30,10,2),
  sample.id=c(rep('sample1',3),rep('sample2',3),rep('sample3',3),rep('sample4',3),rep('sample5',3),
              rep('sample6',3),rep('sample7',3),rep('sample8',3),rep('sample9',3),rep('sample10',3))
)
1 Like

Hey!

I re-arranged the data a bit.

# Generate y data which are n=10
y.data <- tibble::tibble(
  y=rnorm(10,5,1),
  sample.id = c('sample1','sample2','sample3','sample4','sample5','sample6','sample7','sample8','sample9','sample10')
)

# Generate x data which are n=30 (10 samples x 3 repeats)
x.data <- tibble::tibble(
  a=rnorm(30,20,3),
  b=rnorm(30,10,2),
  sample.id=c(rep('sample1',3),rep('sample2',3),rep('sample3',3),rep('sample4',3),rep('sample5',3),
              rep('sample6',3),rep('sample7',3),rep('sample8',3),rep('sample9',3),rep('sample10',3))
)

dat <- list(
  N = 10,
  R = 3,
  y = y.data$y,
  a_meas = matrix(x.data$a, nrow = 10, ncol = 3, byrow = TRUE),
  b_meas = matrix(x.data$b, nrow = 10, ncol = 3, byrow = TRUE)
)

library(rstan)
rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())

p <- stan(file = "test.stan", data = dat)



…and this model kind of works:

data {
  int<lower=0> N; // number of cases in y
  int<lower=0> R; // number of replicates per sample
  vector[N] y;    // outcome (variate)
  vector[R] a_meas[N]; // measurement of a (co-variate)
  vector[R] b_meas[N]; // measurement of b (co-variate)
}
parameters {
  vector[N] z_a;  
  vector[N] mu_a; // prior location
  vector<lower=0>[N] sigma_a; // prior scale

  vector[N] z_b;  
  vector[N] mu_b; // prior location
  vector<lower=0>[N] sigma_b; // prior scale
  
  real mm_a;
  real<lower=0> ms_a;

  real mm_b;
  real<lower=0> ms_b;

  real alpha; // intercept
  real beta; // slope
  real<lower=0> sigma; // outcome noise
}
transformed parameters {
  // non-centered parameterization for a and b
  vector[N] a = mu_a + sigma_a .* z_a; // unknown true value for a  
  vector[N] b = mu_b + sigma_b .* z_b; // unknown true value for b  
  vector[N] x = a ./ b; 
}
model {
  
  z_a ~ std_normal();
  z_b ~ std_normal();

  y ~ normal(alpha + beta * x, sigma);
  
  for (n in 1:N){
    a_meas[n] ~ normal(mu_a[n], sigma_a[n]);
    b_meas[n] ~ normal(mu_b[n], sigma_b[n]);
  }
  
  mu_a ~ normal(0, 10);
  mu_b ~ normal(0, 10);

  sigma_a ~ exponential(1);
  sigma_b ~ exponential(1);

}

But estimating mean and standard deviation from 3 observations is a bit optimistic… So you probably have to work a bit more to make this model work.

Hope this helps!

Cheers!
Max

2 Likes

Thanks so much, Max.

I re-did how the sample data are generated to make sure that y and x are correlated to run a regression and be able to compare outputs from the Stan version vs. a ‘traditional’ averaged approach and lm.

Basically, the two approaches generate very different results. I can’t tell if that’s because of some error in the model or because of what you said about problem estimating mean/sd from three samples.

Here’s the new way to generate the data:

rand <- mvrnorm(n=10, mu=c(10, 20), Sigma=matrix(c(1, 0.8, 0.8, 1), nrow=2), empirical=TRUE)

# Generate y data which are n=10
y.data <- tibble::tibble(
  y=rand[,1],
  sample.id = c('sample1','sample2','sample3','sample4','sample5','sample6','sample7','sample8','sample9','sample10')
)

# Generate x data which are n=30 (10 samples x 3 repeats)
x.data <- tibble(
  b=rnorm(30,2,0.2),
  a_temp=rnorm(30,0,1),
  row=seq(1:30),
  sample.id=rep(c('sample1','sample2','sample3','sample4','sample5','sample6','sample7',
                  'sample8','sample9','sample10'),3)
  ) %>%
  mutate(
    a = ifelse(row %in% 1:10, rep(rand[,2]*b[1:10],3),
               ifelse(row %in% 11:20, rep(rand[,2]*b[11:20],3),
                      ifelse(row %in% 21:30, rep(rand[,2]*b[21:30],3),'NA')))
  ) %>%
  dplyr::select(a,b,sample.id)

x.data$a <- as.numeric(x.data$a)
x.data$b <- as.numeric(x.data$b)

It’s not pretty but it works.

I used your model code to run the Stan model in the example you gave above. Here’s the R code for running the model again just for completeness:

#### RUN STAN MODEL ####
# Format data
dat <- list(
  N = 10,
  R = 3,
  y = y.data$y,
  a_meas = matrix(x.data$a, nrow = 10, ncol = 3, byrow = TRUE),
  b_meas = matrix(x.data$b, nrow = 10, ncol = 3, byrow = TRUE)
)

# Set global model conditions
rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())

# Run model
p <- stan(file = "test.stan", data = dat)

I then used the data to run the ‘traditional’ modeling approach, which is to take the mean of all of the repeats and run the model on the mean data. That establishes a pretty clear relationship between y and x (as we’d expect from how I generated the data using mvrnorm).

#### Typical model ####
# Aggregate data
data.typical <- aggregate(. ~ sample.id, data=x.data,FUN=mean) %>%
  mutate(x = a/b) %>% left_join(y.data)

# Run model
lm(y ~ x, data=data.typical) %>%
  summary()

Here’s the model output:

Residuals:
     Min       1Q   Median       3Q      Max 
-0.70060 -0.54329 -0.00009  0.37858  0.99183 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)  -6.0000     4.2474  -1.413  0.19547   
x             0.8000     0.2121   3.771  0.00546 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.6364 on 8 degrees of freedom
Multiple R-squared:   0.64,	Adjusted R-squared:  0.595 
F-statistic: 14.22 on 1 and 8 DF,  p-value: 0.005456

When I compare this to the alpha and beta parameters from the Stan model they’re totally different:

print(p,pars=c('alpha','beta'))

Inference for Stan model: test.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

       mean se_mean   sd  2.5%   25%   50%   75% 97.5% n_eff Rhat
alpha 10.17    0.32 0.51  9.44  9.83  9.96 10.86 10.98     3 2.45
beta  -0.02    0.01 0.02 -0.06 -0.03 -0.01  0.00  0.01    13 1.38

Samples were drawn using NUTS(diag_e) at Wed Feb 19 11:36:13 2020.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

Clearly the model had convergence issues given the Rhat values. But these parameters aren’t very close. The same thing happens when you look at the averaged x data compared to the Stan-generated x parameters:

# Compare estimated x data with averaged
print(p,pars=c('x'))

Inference for Stan model: test.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

        mean se_mean     sd    2.5%    25%    50%   75% 97.5% n_eff Rhat
x[1]   18.90    4.88   7.34    6.94  11.63  20.35 25.04 29.08     2 3.27
x[2]  -20.06   23.28  46.22 -117.59 -33.18 -12.95 15.06 15.72     4 1.50
x[3]   17.49    3.67   7.48    8.33  13.75  18.56 19.95 25.32     4 1.45
x[4]  -73.01   70.76 177.53 -535.33 -59.99 -24.27 -1.32 18.17     6 1.33
x[5]   18.04    1.03   1.56   15.00  16.98  18.34 19.23 20.22     2 3.70
x[6]   15.95    2.22   4.51    7.52  13.66  17.26 18.75 23.07     4 1.53
x[7]   19.41    2.41   3.93   13.66  14.25  20.08 21.87 27.21     3 2.26
x[8]  -41.17   43.58  88.70 -246.54 -85.63   4.51 18.65 23.47     4 1.51
x[9]  -13.88   38.04  91.75 -182.09   7.84  17.23 18.08 21.88     6 1.34
x[10] -12.72   47.19  80.33 -203.28 -12.25   9.26 15.05 87.51     3 2.23

Samples were drawn using NUTS(diag_e) at Wed Feb 19 11:36:13 2020.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

data.typical %>%
  dplyr::select(x,sample.id)

          x sample.id
1  19.96092   sample1
2  20.17132  sample10
3  21.40227   sample2
4  19.21485   sample3
5  20.88467   sample4
6  20.19328   sample5
7  19.57398   sample6
8  18.36020   sample7
9  21.27412   sample8
10 18.96438   sample9

Again, Rhat is a big issue there.

Do you have any thoughts on what might be driving this discrepancy? Are there tweaks to the parameterization that could improve convergence and hopefully lead the two approaches to be more similar?

I really appreciate your point that estimating mean and sd from three repeats is problematic. Should I be thinking about this problem in a different way? Do you think I should focus more on characterizing a ‘fixed’ uncertainty for each assay and then using the approach from the Stan User’s Guide where I give each sample observation a fixed error? That just seems like I’m ‘throwing away’ data by taking the average of the three repeats and then assigning a uniform error for all samples. I know that doesn’t reflect ‘reality’ because I would expect each sample to have different errors. My data are for soil measurements and different soil types have different inherent properties that would lead error to be different for different samples.

Thanks for all of the feedback!