Measurement error models on lognormal distributed data

Hi
I want to model plant mass data that I recorded with a good idea of the measurement error on my data. Because plant mass can only be positive, I tried to model this with a lognormal error distribution in BRMS, making use of the “mi(sdy = [measurement error])” capabilities to incorporate measurement error. I include a toy example below. In this simulation, as is the case in my plant mass data, the Gaussian measurement error results in negative observations of mass.

#known measurement error of 0.5
m_err <- 0.5
#simulate lognormal real plant and observed plant mass
sim1_data <- tibble(n = 1:1000)%>%mutate(m_err = m_err,
                                         real_mass = exp(rnorm(n(), -2, 1)), 
                                         obs_mass = rnorm(n(), real_mass, m_err))

bf_sim1 <- bf(obs_mass | mi(sdy = m_err) ~ 1, family="lognormal")

m_sim1 <- brm(data = sim1_data, 
             bf_sim1,
             backend = "cmdstanr",
             prior = c(
               set_prior("normal(0,1)", class = "Intercept"),
               set_prior("cauchy(0,1)", class = "sigma")),
             iter = 500, warmup = 100)

Brms does not let me model this with negative observed outcomes: “Error: Family ‘lognormal’ requires response greater than 0.” However, when inspecting the stan code generated by BRMS using the make_stancode() function, it seems that it models the outcome variable as the result of a gaussian draw with mean the positive latent “real” plant mass with sd the given measurement error.

  1. Does this not mean that negative outcomes should be allowed. After all, negative outcomes have a non-zero probability under any guassian distribution no matter the mean.
  2. Is there a way to define the such a model that circumvents this restriction by using the latent variable imputation of BRMS? I tried fitting a multivariate formula with the formulas below and fixing the coefficients for mi(real_mass) and m_err to 1 so that the first formula models the measurement error process. This model has a very hard time converging.
bf_sim1_me <- bf(obs_mass ~ 0+mi(real_mass), sigma ~ 0 + m_err, family = 'normal')
bf_sim1_dwr <-bf(real_mass | mi() ~ 1, family = brmsfamily("lognormal", link_sigma = "identity"))

I am returning to the Stan/BRMS forums with a similar question to a previous post which did not yield a working solution (Model lognormal distributed real values with measurement error that yield negative observed values in brms). With the data set in question, I am always returning to this same problem. So, I thought to give it another shot.

Thank you for any help.

From your description, it looks like the mean of your outcome variable is what should be log-normally distributed, right? That is, if I’m understanding it right, your observed variable is a compound distribution: Normal with a log-normal mean and a constant SD. If so, I wonder whether this is something you can address with a non-linear formula; something like:

obs_mass ~ exp(true_mass),
true_mass ~ 1 + (1 | index),
sigma ~ 1,
nl = TRUE,
family = "gaussian"

You would need to add an index variable to allow true_mass to follow a log-normal distribution across observations; the random effect of index would then be the SD of the log-normal component (and the intercept would be the mean).

  1. Does this not mean that negative outcomes should be allowed. After all, negative outcomes have a non-zero probability under any gaussian distribution no matter the mean.

I certainly think you are right, this seems like a bug.

Are you open to building a Stan model for it (outside brms)? That might be the best way forward.

I would love to be wrong but I don’t think there is a way around the fact that you lack the weight of the envelope before drying…

Your goal is to relate dry mass to some predictors, presumably. Even if you get around the "Error: Family ‘lognormal’ requires response greater than 0.” your error is not measured per observation; it is a distribution of weights of the used envelopes, but it is disconnected from any given observation. See also this discussion.

Thank you for the response. This seems to work.

1 Like

I am open to build a model in Stan, but have no experience. I believe the model implementation suggested above is a working solution inside brms.

Thank you for the response. However, would you care to clarify a bit further because I don’t seem why a measure of measurement error per observation would be necessary? I assume that the source of measurement error is from not knowing the exact weight of the envelope the plants were weighed in; envelopes that can be characterized by a normal distribution with a weight and sd. In this case the measurement error does not seem to be dependent on the the weight of the plants nor on predictors that are presumably affecting the plant’s weight.
How would it look like to measure the error caused by an envelope per observation? Measure the same dried material in different envelopes?
Also, in the linked post you mention that measurement error does not seem to affect the model predictions if it is independent of the predictors. I interpret this as follows, that added variation by measurement error is absorbed in the variation parameter of the modelled error distribution. However, I think that in my case it does affect the model’s predictions because, even though the measurement error (envelope) is not affected by the predictor, the inference of the predictor’s effect is affected.

Before any of that, now that I run your code, I can’t figure out how you get negative observed mass (in your real data).

Isn’t this a better representation of what you are dealing with? (You have tot_mass but not real_mass. You know the distribution of “error”)

sim2_data <- tibble(n = 1:1000) %>% mutate(real_mass = exp(rnorm(n(), -2, 1)), 
                                           env_mass = rnorm(n(), 2, .5),
                                           tot_mass = real_mass + env_mass)

bf_sim2 <- bf(tot_mass | mi(.5) ~ 1, family="lognormal")

m_sim2 <- brm(data = sim2_data, 
              bf_sim2,
              backend = "cmdstanr",
              prior = c(
                set_prior("normal(0,1)", class = "Intercept"),
                set_prior("cauchy(0,1)", class = "sigma")),
              iter = 5000, warmup = 1000)

About my comment on observation level error:
Let’s stick to the example at hand (but leave the envelope out for simplicity). You measure dry weight of plants (out of the envelope, into a tray that has been tared). Let’s say that you are working with a balance whose precision limit is close to the weights you are working with. For heavy plants you get more precise measurements than for light ones. So you measure every plant 5 times. For heavy plants these measurements will be more consistent than for light ones. So lower measurement error for heavy plants than for light ones. And you have an error estimate for each plant.

Unless I am mistaken, this is the type of measurement error that you can account for using the brms mi() function. What you are dealing with, instead, is random noise from the variation of weight of your envelopes, which is disconnected from your response or your predictors.

Your simulation example is a good representation of how the tot_mass is produced. However, if I infer the effect of my predictors on tot_mass, the effect of the envelopes will be largely absorbed by the intercept, I guess. This way, the model cannot distinguish between envelopes and plant weight.
I thought I could do a bit better because I have a lot of information on the envelopes themselves and could then estimate a plant weight in each sample (with the restriction of a lognormal distribution). Assuming the envelopes have a normally distributed mass with a known mean and standard deviation, tot_mass ~ norm(mean_env, sd_env) + lognorm(exp_plantweight, sigma_weight). This, I think, means that tot_mass - mean_env ~ norm(0, sd_env) + lognorm(exp_plantweight, sigma_weight) ~ norm( lognorm(exp_plantweight, sigma_weight), sd_env). This is what I tried to model and the reason why the response variable (tot_mass-mean_env) has negative values.

I think I understand your point now on measuring measurement error per observation. However, you may have picked up on the fact that it was impossible to separate the plant material from the envelope. So, this was the best we could do. Though, plant+envelope mass was orders of magnitudes above the precision limit of the microbalans.