Machine noise -- do we have to be fancy?

Continuing the discussion from How to left censor with an unknown limit of detection?:

I have been trying to come up with an approach to deal with a problem, which may general enough to affect (to varying degrees) anyone dealing with data from microscopes and spectrophotometers. → That makes me suspect it’s been dealt with before though I can’t find any information about it online. Sorry if this is actually a well known problem I’m re-articulating.

Spectrophotometers and microscopes shine light and report something like absorbance, fluorescence intensity, or light scattering from samples, and these values correlate with many quantities of interest, like bacterial density, chemical concentrations, etc. Now, importantly, these machines don’t want to report negative values, so they always inject a little extra voltage, adding a constant level of machine noise \epsilon to any measurement. Suppose this noise is normally distributed i.e. \epsilon \sim \mathcal{N}(L,\tau), where L is a lower Limit of Detection (LOD_ and \tau is the variation in this machine-added background noise.

The reason I call L, the location parameter of \epsilon, a LOD is because the outcome of this process is such that if someone tries to measure the signal Y_{measured} from a sample, but the expected signal is very low in magnitude, i.e. \mathbb{E}[Y_{measured}] <<L, you will get an signal around L \pm \tau instead. Essentially, because of this, I believe the problem is a censored data–type problem, but not one which I found matches with the current types of censored models described in the Stan manual, or online for that matter. I am sure that I am not the first one to think of this problem, however, and would really appreciate pointers. I should mention I am a biologist and not a theoretician by any stretch.

What’s motivating is, even though \epsilon is unknown, it may be actually quite easy for models to estimate \epsilon from data. In fact, people regularly estimate a quantity similar to it by themselves just by looking at data without the aid of a model: for example, a heuristic and well-known way to define a LOD in spectrophotometry is to estimate \epsilon using the mean and standard deviation of blank samples (where signal should be zero). People often include blank samples in their data collection phase to get an idea of machine noise levels. Sometimes the standard deviation gets multiplied by some scaling factor F (seems to be a safety precaution) such that the final definition of the LOD for your experiment is L + F\tau. I’ve seen values of F between 1.5-3. Experiments may have an upper bound as well (though for different reasons), and you essentially chuck out all data that falls beyond these limits. But since L and \tau are potentially easy to estimate, including that additional error structure in the model may help in some very important ways for me.

This is the only approach that has worked for me so far in terms of coding it in Stan.
\begin{align} &Y_{measured} \sim Lognormal (\ln (\mu),\tau)\\ &\mu = L + Y_{obs}\\ &Y_{obs} \sim Lognormal(\ln (\nu),\sigma)\\ &\nu = f(X_{1},X_{2},...) \end{align}
This nested lognormal mess here is obviously not ideal because it implies there is higher machine noise and experimental noise for high values of Y_{measured}, and also the nested lognormal variance compound on each other such that the actual variance around the detection limit is no longer really \tau but some function of \tau and \sigma.

But, I think the reason that this one works but nothing else does is perhaps since the sum of lognormals is sometimes well-approximated by another lognormal according to literature, this works without having to perform convolutions of random variables. But

This machine noise makes no difference in some cases. Suppose you are interested in a linear slope, for example, which relates the amount of response linearly to some added drug

(reprex example; plot below)
require(tidyverse)
set.seed(3)
expand_grid(
  drug = 2^(0:-10), #drug concentation, max set to 1 for simplicity
  rep=1:5 # experimental replicate
) %>% mutate(
  response_mu = 3*drug, #expected / mean response
  response = rlnorm(n(),meanlog=log(response_mu),sdlog=0.15), # actual response 
  observed = rlnorm(n(),log(response+0.05),0.1) # response with machine noise
) %>% ggplot(aes(drug,observed)) +  coord_trans(x='log10',y='log10')+
  theme_classic() + geom_point() + scale_x_continuous(breaks = c(0.001,0.01,0.1,1)) +
  labs(y= 'observed response')

image

In this case the easiest thing to do is just to retain the linear portion of the data, and that’s what people do all the time. No need to model machine noise.

However, let me show an example where it can really matter (I have other examples where it would matter and maybe you can think of your own).

Suppose you are using a spectrophotometer to measure the density of bacteria in a sample that is growing according to the logistic growth equation
\frac{d}{dt}N(t) = rN(t)(1-\frac{N(t)}{K})
where r is the exponential growth rate of the bacteria, and K is the maximum density observed.
Suppose you would like to solve such an equation as an initial value problem and fit it to the data, and that you assume the error to be distributed log-normally. The initial value in this case would be N_{0}, the initial density.
If this value is not known, it seems to me that the extra noise at the lower end of measurement will badly affect estimating not only N_{0} but also r and \sigma as well as well. To see why, note that r is essentially the slope of the logarithm of the signal for the growing portion of the curve (before density saturates around K), so variation at that point makes the slope very uncertain. The model without machine noise expects a constant noise throughout the growth curve, so higher noise at the lower end makes the overall \sigma an average of the initla and final growth phases, and it takes the initial variation too seriously, whereas I believe adding machine noise element to the model will decrease this variation’s importance.

Here is a simulation to show better what I mean

Reprex for logistic growth
library(deSolve)
ode_density_logis <- function (time, init, parms, ...) {
  with(as.list(c(parms, init)), {
    dN <- r*N*(1-(N)/K)
    list(dN)
  })
}
grow_density_logis <- function(time, parms, ...) {
  init <- parms["N0"] # initial values
  names(init) <- 'N' # force names of state variables
  odeparms <- parms[c("K","r")] # the parms of the ODE model
  out <- as.data.frame(ode(init, time, ode_density_logis, parms = odeparms))
  out}


N0 = 0.001; r = 1; K = 1;
set.seed(123)

tibble(rep=1:10) %>% 
  mutate(Hours = list(0:24),
    N_t = map(Hours,~rlnorm(.x,log(grow_density_logis(.x,parms = c(N0=N0,r=r,K=K))$N),sdlog=0.1)),
    N_t_measured = map(N_t, ~ rnorm(25,mean = (.x + 0.001),sd=0.01))) %>% 
  unnest(everything()) %>% ggplot(aes(Hours,N_t_measured)) + geom_point(aes(color = 'observed')) + scale_y_log10() +
  geom_point(aes(y=N_t,color='actual')) + labs(y='Bacterial density', color = NULL) + theme_classic()```

When I try to fit such a model with N_{0} as an unknown, I get the following error for my posterior predictions:


I plot the average relative error against time (yrep - observed y / observed), and I see errors inflating at the initial timepoints because of machine noise. Then, I also think that the estimate of \sigma is overinflated and is also highly sensitive to outliers at the initial timepoints.

I have yet to test if adding the machine noise feature I described above (the nested lognormal one above) improves the modelling, but I will add that to this post soon when I have more time. I just wanted to continue on the problem from the previous post now that I have gotten more clarity as to what I think it really could be and say that I have gotten some goods results with the nested lognormal approach. To see the code for the nested lognormals I have provided the code in the previous post (linked at the top).

Thank you anyone for their two cents.

1 Like

I see why you say it looks like censoring here, as it does have that feel. I think it’s actually closer to a “noisy measurement model” and you’ll have more luck finding answers in that chapter of the User’s Guide and in the literature.

In your case, you should be able to treat your observed measurement as containing two components, a noisy underlying measurement I’ll call Z and the injected noise that I’ll call \epsilon (following your notation), so that

Y^{\textrm{measured}}_n = Z_n + \epsilon_n

where \epsilon_n \sim \textrm{normal}(L, \tau) is the injected noise and I’m assuming Z_n \sim \textrm{normal}(Y^{\textrm{true}}_n, \sigma) is the measurement model before the injected noise, and our goal is to do inference for Y^{\textrm{true}}. Normals are very convenient here, in that we can just build a straight-up measurement error model by noticing that the above implies

Y^{\textrm{measured}}_n \sim \textrm{normal}(Y^{\textrm{true}}_n + L, \sigma + \tau)

With that, you can proceed the same way as suggested in the noisy measurement chapter of the User’s Guide. You may not be able to identify \sigma + \tau, though, and it may make more sense to combine them. You’ll need a prior for L and presumably a lower bound for its value in its declaration.

The model you describe does not technically enforce positivity because normals can take on any value (in theory—in practice, you’re bounded by scale). If you really know everything’s positive and the errors are multiplicative (depend on scale of value) rather than absolute, then you can convert the whole thing over to lognormal. I’d think of that as just being normal on the log scale, then average again and hop back.

1 Like

The fact that the noise is engineered into the machine in order to keep all readouts positive motivates the constraint. Also, for most outputs that I can think of for these types of experiments, variables like concentration, density, fluorescent intensity etc. are thought of as \mathbb{R}^{+}.

I do like most the idea of having Z_{n} as lognormal and \epsilon as normal, because it makes sense from the variance perspective: machines do not add more noise to higher samples than to lower samples, and therefore a multiplicative relationship between L and Y noise doesn’t make sense. But I fear that having
\ln(Y^{measured}_{n}) \sim \mathcal{N}(\ln(Y^{true}_{n}) + \ln(L),\sigma + \tau)
will do exactly that once converted to a normal scale.

I should mention, I could not make a model with Z_{n} as lognormal and \epsilon as normal sample in Stan. I wondered if it’s because the log-likelihood of such a process would have to be modelled as the convolution of a normal with a lognormal.