Propagation of Measurement Error on the outcome within mi()

Dear Stan community,

This story concerns beavers and their influence on dissolved organic carbon (DOC). The outcome is defined as: \Delta DOC. That is: the stream carbon defined as:

\Delta DOC[mg/L] = DOC_{downstream}\ \ - DOC_{upstream}

where DOC_{downstream} & DOC_{upstream} are the observed carbon concentration measurements expressed in milligrams/Liter [mg/L] , and upstream and downstream represent the sampling location with respect of the beaver pond. These two measurements have an instrument error measurement of 0.2 mg/L.
The model I want to fit is defined as:


bform <-  bf(delta_DOC | mi(se_response)~ 0 + season,
             sigma ~ 0 + season,
             nu = 3
            ) + student()

My question is: How do I estimate the se_response ? Since the outcome is \Delta DOC, defined by difference of two measurements with 0.2 mg/L error mesuraments each.

Open to listen from you! Thanks !

1 Like

I would let the model estimate the difference down- to upstream:

library(brms)
library(emmeans)

d = data.frame(DOC = rnorm(40, 10, 2),
               beaver = c("up","down"),
               site = c(rep(letters[1:10], each = 2),
                        rep(letters[1:10], each = 2)),
               season = rep(c("sp","au"), each = 20))

m = brm(bf(DOC | mi(.2) ~ 0 + season:beaver + (0 + season:beaver|site)),
          backend = "cmdstanr",
          chains = 4,
          iter = 4000,
          warmup = 2000,
          cores = 4,
          d)
summary(m)
em = emmeans(m,  ~ beaver | season)
summary(em, point = "mean", level = .95)
p = pairs(em)
summary(p, point = "mean", level = .95)
3 Likes

Dear Angelos,
Thank you for the advise. Cleaver and elegant. I have some questions:

  1. Why do you suggest to include as random effect the “site”? Are you proposing to include “site” level because the model needs to know how to pair the DOC_{upstream} and DOC_{downstream} observations?
    In fact, we sampled one in summer and one in winter for the same site. I am attaching part of the data. Please see the plot too.

  2. Following your advice, I am modelling the DOC_{upstream} and DOC_{downstream} observations using the log-normal family. This is because DOC is continuous positive outcome. So I consider this model:

\begin{align*}\color{red}{\text{DOC}_{\text{OBS}, i}} & \color{red}\sim \color{red}{\operatorname{Log-Normal}(\text{DOC}_{\text{TRUE}, i}, \text{DOC}_{\text{SE}, i})} \\\color{red}{\text{DOC}_{\text{TRUE}, i}} & \sim \operatorname{Log-Normal}(\mu_i, \sigma) \\\mu_i & = e^{\alpha_{\text{location}[season]} } \\\alpha_{\text{upstream}, [summer]} & \sim \operatorname{Normal}(0.5, 1.3) \\\alpha_{\text{upstream}[winter]} & \sim \operatorname{Normal}(0, 1.2) \\\alpha_{\text{downstream}, [summer]} & \sim \operatorname{Normal}(0.5, 1.3) \\\alpha_{\text{downstream}, [winter]} & \sim \operatorname{Normal}(0, 1.2) \\\sigma & \sim \operatorname{Student-t}(3, 0, 0.5).\end{align*}

And the code is

bf_mi <-  bf(DOC | mi(0.2)  ~ 0 + season:location) + lognormal(link = "identity")

priors <- c(  
#~~~~~~~~~~~ the average summer DOC concentration downstream ~~~~~~~~~
set_prior("normal(0.5, 1.3)", class = "b", coef = "seasonsummer:locationout" ), 
#~~~~~~~~~~~ the average summer DOC concentration upstream ~~~~~~~~~
set_prior("normal(0.5, 1.3)", class = "b", coef = "seasonsummer:locationin"), 
#~~~~~~~~~~~ the average winter DOC concentration upstream ~~~~~~~~~
set_prior("normal(0, 1.2)", class = "b", coef = "seasonwinter:locationin"), 
#~~~~~~~~~~~ the average winter DOC concentration downstream ~~~~~~~~~
set_prior("normal(0, 1.2)", class = "b", coef = "seasonwinter:locationout") ,
#~~~~~~~~~~~ the  residual standard deviation ~~~~~~~~~
set_prior("student_t(3, 0, 0.5)", class = "sigma" ))

# fit the model 

b1 <- brm(formula = bf_mi,
        data = d4, # observed data 
        prior = priors,
        iter = 4000, 
        warmup = 2000,  # number of warmup iterations
        chains = 4, 
        cores = 4,
        control = list(adapt_delta = 0.99, max_treedepth = 15), # facilitate convergence
        backend = "cmdstanr",
        save_pars = save_pars(latent = TRUE),
        seed = 7, # allow exact replication of results 
        file = here("results","Leonardo","models_fit","DOC","b1"),
        sample_prior = TRUE, # sample also the priors 
         )

The data:
d4.csv (21.5 KB)

Yes, at least this is how I think it should be specified.

Regarding you second point, log-normal makes sense. I say this with very little confidence: I think that your first line in your model description might be wrong. I believe that the observed DOC should be normally distributed around the true value.

Also, I am generally not the right person to help with formal model description. Perhaps @Solomon could have a look?

So, then, I would do it like this:

b2 <- brm(bf(DOC | mi(0.2) ~ 0 + season:location + (0 + season:location|site)),
          family = lognormal(),
          data = d4,
          prior = priors,
          iter = 4000, 
          warmup = 2000,
          chains = 4, 
          cores = 4,
          control = list(adapt_delta = 0.99),
          backend = "cmdstanr",
          save_pars = save_pars(latent = TRUE),
          seed = 7
)
pp_check(b2, ndraws = 100)
summary(b2)
plot(b2)
em = emmeans(b2, ~ location | season)
summary(em, point = "mean", level = .95)
p = pairs(em)
summary(p, point = "mean", level = .95)

I have thought about what you have stated. I think you are right. The DOC_{OBS} \sim Normal(\mu_{true}, \sigma) with \sigma = 0.2.
And then, I model
\mu_{true} \sim LogNormal(\mu , \sigma)
Therefore I assume that I am modelling the mean of DOC_{OBS} if I follow your suggestion.
I read the 3.7.1 chapter from @bnicenboim et al. 2024. And I am not sure. Please @Solomon or anyone else give us some insights.

1 Like
brms::stancode(formula, family = lognormal(), data = d4)

The relevant part of the output:

model {
  // likelihood including constants
[...]
  // priors including constants
  target += lprior;
  target += normal_lpdf(Y[Jme] | Yl[Jme], noise[Jme]);
}

;)

Thanks Angelos. May you please explain in details this line of code:

  target += normal_lpdf(Y[Jme] | Yl[Jme], noise[Jme]);

since I see :

  int<lower=1> N;  // total number of observations
  vector[N] Y;  // response variable
  // data for measurement-error in the response
  vector<lower=0>[N] noise;
  // information about non-missings
  int<lower=0> Nme;
  array[Nme] int<lower=1> Jme;
  int<lower=1> K;  // number of population-level effects
  matrix[N, K] X;  // population-level design matrix
  // data for group-level effects of ID 1
  int<lower=1> N_1;  // number of grouping levels
  int<lower=1> M_1;  // number of coefficients per level
  array[N] int<lower=1> J_1;  // grouping indicator per observation
  // group-level predictor values
  vector[N] Z_1_1;
  vector[N] Z_1_2;
  vector[N] Z_1_3;
  vector[N] Z_1_4;
  int<lower=1> NC_1;  // number of group-level correlations
  int prior_only;  // should the likelihood be ignored?
}

Question: what is Y[Jme] ?

I have fitted the same model with and without error measurement. What I got is a not different posterior predictins of the outcome by season and location. Isn’t it quite surprising, Angelos? I am trying to figure out , why.


This is the code:

# brms formula with error measurement 
bf_mi <-  bf(DOC | mi(0.2)  ~ 0 + season:location + (0 + season:location| site)) + lognormal(link = "identity")

# brms formula without error measurement

bf <-  bf(DOC   ~ 0 + season:location + (0 + season:location| site)) + lognormal(link = "identity")

# Set priors 
priors_informative <- c(  
  #~~~~~~~~~~~ the average summer DOC concentration downstream ~~~~~~~~~
  set_prior("normal(0.5, 1.3)", class = "b", coef = "seasonsummer:locationout" ), 
  #~~~~~~~~~~~ the average summer DOC concentration upstream ~~~~~~~~~
  set_prior("normal(0.5, 1.3)", class = "b", coef = "seasonsummer:locationin"), 
  #~~~~~~~~~~~ the average winter DOC concentration upstream ~~~~~~~~~
  set_prior("normal(0, 1.2)", class = "b", coef = "seasonwinter:locationin"), 
  #~~~~~~~~~~~ the average winter DOC concentration downstream ~~~~~~~~~
  set_prior("normal(0, 1.2)", class = "b", coef = "seasonwinter:locationout") ,
  # between-group (here, between-site) variability. 
  set_prior("normal(0, 0.2)", class = "sd", group= "site", coef = "seasonsummer:locationin") ,
  set_prior("normal(0, 0.2)", class = "sd", group= "site", coef = "seasonsummer:locationout") ,
  set_prior("normal(0, 0.1)", class = "sd", group= "site", coef = "seasonwinter:locationin") ,
  set_prior("normal(0, 0.1)", class = "sd", group= "site", coef = "seasonwinter:locationout") ,
  
  #~~~~~~~~~~~ the  residual standard deviation ~~~~~~~~~
  set_prior("student_t(3, 0, 0.5)", class = "sigma" )
  
)

# then fit the models 

b1 <- brm(formula = bf_mi,
        data = d4, # observed data 
        prior = priors_informative,
        iter = 4000, 
        warmup = 2000,  # number of warmup iterations
        chains = 4, 
        cores = 4,
        control = list(adapt_delta = 0.99, max_treedepth = 15), # facilitate convergence
        backend = "cmdstanr",
        save_pars = save_pars(latent = TRUE),
        seed = 7, # allow exact replication of results 
        sample_prior = TRUE, # sample also the priors 
         )

b1_no_error <- brm(formula = bf,
        data = d4, # observed data 
        prior = priors_informative,
        iter = 4000, 
        warmup = 2000,  # number of warmup iterations
        chains = 4, 
        cores = 4,
        control = list(adapt_delta = 0.99, max_treedepth = 15), # facilitate convergence
        backend = "cmdstanr",
        save_pars = save_pars(latent = TRUE),
        seed = 7, # allow exact replication of results 
        sample_prior = TRUE, # sample also the priors 
         )
# generate new data frame 
conditions <- expand.grid(season = c("summer","winter"), location = c("out","in"), site = unique(d4$site))

# predicted draws with error measurament
f_error <- 
  add_predicted_draws(b1, ndraws = 1500 , newdata = conditions, scale = "response") %>%
  as_tibble() |> 
  mutate(model = "with error mi(0.2)")

# with no error measurament
f_no_error <- 
  add_predicted_draws(b1_no_error, ndraws = 1500, newdata = conditions,  scale = "response") %>%
  as_tibble() |> 
  mutate(model = "no error")
# bind df 
dmerg <- rbind(f_no_error, f_error) 

# Plot 

ggplot(dmerg, aes(x = .prediction)) +
  stat_slab(aes(fill = model), alpha = .53) + 
  stat_pointinterval(aes(color = model), position = position_dodge(width = .4, preserve = "single"), .width = c(0.66, 0.95)) +
  facet_wrap(season ~ location, scales = "free_x", axes = "all", axis.labels = "all_x") +
  labs(x = "DOC (mg/L)", y = NULL)+
  theme_bw()+
  
  facetted_pos_scales(x = list(
      location == "in" & season == "summer" ~ scale_x_continuous(breaks = seq(from = 0, to = 10, by = 1), limits= c(0, 10)),
      location == "in" & season == "winter" ~ scale_x_continuous(breaks = seq(from = 0, to = 7, by = 1), limits = c(0, 7)),
      
    location == "out" & season == "summer" ~ scale_x_continuous(breaks = seq(from = 0, to = 12, by = 1), limits = c(0, 12)),
    location == "out" & season == "winter" ~ scale_x_continuous(breaks = seq(from = 0, to = 10, by = 1), limits = c(0 , 10))
    ))

Here’s a quote from McElreath’s Rethinking:

Probably the majority of regressions are really using proxies like Dobs, because most variables are measurements with some error. […], using a proxy can introduce systematic bias, distorting the estimates. Since the extent of measurement error varies […] in a way that is associated with variables of interest, that is likely in this example.

My understanding of the above is that when measurement error is unrelated to your response or predictors, it is unlikely to have an influence on estimates. If your error is the manufacturer’s reported instrument precision, it is unaffected by conditions. Contrast that to, say, making more imprecise measurements in winter vs summer or making more imprecise measurements when DOC is lower. In that case, if you had an estimate of error by taking several measurements on the spot, including that in the model might have some influence.