Continuing the discussion from How to left censor with an unknown limit of detection?:

I have been trying to come up with an approach to deal with a problem, which may general enough to affect (to varying degrees) anyone dealing with data from microscopes and spectrophotometers. → That makes me suspect it’s been dealt with before though I can’t find any information about it online. Sorry if this is actually a well known problem I’m re-articulating.

Spectrophotometers and microscopes shine light and report something like absorbance, fluorescence intensity, or light scattering from samples, and these values correlate with many quantities of interest, like bacterial density, chemical concentrations, etc. Now, importantly, these machines don’t want to report negative values, so they always inject a little extra voltage, adding a constant level of machine noise \epsilon to any measurement. Suppose this noise is normally distributed i.e. \epsilon \sim \mathcal{N}(L,\tau), where L is a lower Limit of Detection (LOD_ and \tau is the variation in this machine-added background noise.

The reason I call L, the location parameter of \epsilon, a LOD is because the outcome of this process is such that if someone tries to measure the signal Y_{measured} from a sample, but the expected signal is very low in magnitude, i.e. \mathbb{E}[Y_{measured}] <<L, you will get an signal around L \pm \tau instead. Essentially, because of this, I believe the problem is a censored data–type problem, but not one which I found matches with the current types of censored models described in the Stan manual, or online for that matter. I am sure that I am not the first one to think of this problem, however, and would really appreciate pointers. I should mention I am a biologist and not a theoretician by any stretch.

What’s motivating is, even though \epsilon is unknown, it may be actually quite easy for models to estimate \epsilon from data. In fact, people regularly estimate a quantity similar to it by themselves just by looking at data without the aid of a model: for example, a heuristic and well-known way to define a LOD in spectrophotometry is to estimate \epsilon using the mean and standard deviation of blank samples (where signal should be zero). People often include blank samples in their data collection phase to get an idea of machine noise levels. Sometimes the standard deviation gets multiplied by some scaling factor F (seems to be a safety precaution) such that the final definition of the LOD for your experiment is L + F\tau. I’ve seen values of F between 1.5-3. Experiments may have an upper bound as well (though for different reasons), and you essentially chuck out all data that falls beyond these limits. But since L and \tau are potentially easy to estimate, including that additional error structure in the model may help in some very important ways for me.

This is the only approach that has worked for me so far in terms of coding it in Stan.

\begin{align}
&Y_{measured} \sim Lognormal (\ln (\mu),\tau)\\
&\mu = L + Y_{obs}\\
&Y_{obs} \sim Lognormal(\ln (\nu),\sigma)\\
&\nu = f(X_{1},X_{2},...)
\end{align}

This nested lognormal mess here is obviously not ideal because it implies there is higher machine noise and experimental noise for high values of Y_{measured}, and also the nested lognormal variance compound on each other such that the actual variance around the detection limit is no longer really \tau but some function of \tau and \sigma.

But, I think the reason that this one works but nothing else does is perhaps since the sum of lognormals is sometimes well-approximated by another lognormal according to literature, this works without having to perform convolutions of random variables. But

This machine noise makes no difference in some cases. Suppose you are interested in a linear slope, for example, which relates the amount of response linearly to some added drug

## (reprex example; plot below)

```
require(tidyverse)
set.seed(3)
expand_grid(
drug = 2^(0:-10), #drug concentation, max set to 1 for simplicity
rep=1:5 # experimental replicate
) %>% mutate(
response_mu = 3*drug, #expected / mean response
response = rlnorm(n(),meanlog=log(response_mu),sdlog=0.15), # actual response
observed = rlnorm(n(),log(response+0.05),0.1) # response with machine noise
) %>% ggplot(aes(drug,observed)) + coord_trans(x='log10',y='log10')+
theme_classic() + geom_point() + scale_x_continuous(breaks = c(0.001,0.01,0.1,1)) +
labs(y= 'observed response')
```

In this case the easiest thing to do is just to retain the linear portion of the data, and that’s what people do all the time. No need to model machine noise.

However, let me show an example where it can really matter (I have other examples where it would matter and maybe you can think of your own).

Suppose you are using a spectrophotometer to measure the density of bacteria in a sample that is growing according to the logistic growth equation

\frac{d}{dt}N(t) = rN(t)(1-\frac{N(t)}{K})

where r is the exponential growth rate of the bacteria, and K is the maximum density observed.

Suppose you would like to solve such an equation as an initial value problem and fit it to the data, and that you assume the error to be distributed log-normally. The initial value in this case would be N_{0}, the initial density.

If this value is not known, it seems to me that the extra noise at the lower end of measurement will badly affect estimating not only N_{0} but also r and \sigma as well as well. To see why, note that r is essentially the slope of the logarithm of the signal for the growing portion of the curve (before density saturates around K), so variation at that point makes the slope very uncertain. The model without machine noise expects a constant noise throughout the growth curve, so higher noise at the lower end makes the overall \sigma an average of the initla and final growth phases, and it takes the initial variation too seriously, whereas I believe adding machine noise element to the model will decrease this variation’s importance.

Here is a simulation to show better what I mean

## Reprex for logistic growth

```
library(deSolve)
ode_density_logis <- function (time, init, parms, ...) {
with(as.list(c(parms, init)), {
dN <- r*N*(1-(N)/K)
list(dN)
})
}
grow_density_logis <- function(time, parms, ...) {
init <- parms["N0"] # initial values
names(init) <- 'N' # force names of state variables
odeparms <- parms[c("K","r")] # the parms of the ODE model
out <- as.data.frame(ode(init, time, ode_density_logis, parms = odeparms))
out}
N0 = 0.001; r = 1; K = 1;
set.seed(123)
tibble(rep=1:10) %>%
mutate(Hours = list(0:24),
N_t = map(Hours,~rlnorm(.x,log(grow_density_logis(.x,parms = c(N0=N0,r=r,K=K))$N),sdlog=0.1)),
N_t_measured = map(N_t, ~ rnorm(25,mean = (.x + 0.001),sd=0.01))) %>%
unnest(everything()) %>% ggplot(aes(Hours,N_t_measured)) + geom_point(aes(color = 'observed')) + scale_y_log10() +
geom_point(aes(y=N_t,color='actual')) + labs(y='Bacterial density', color = NULL) + theme_classic()```
```

When I try to fit such a model with N_{0} as an unknown, I get the following error for my posterior predictions:

I plot the average relative error against time (yrep - observed y / observed), and I see errors inflating at the initial timepoints because of machine noise. Then, I also think that the estimate of \sigma is overinflated and is also highly sensitive to outliers at the initial timepoints.

I have yet to test if adding the machine noise feature I described above (the nested lognormal one above) improves the modelling, but I will add that to this post soon when I have more time. I just wanted to continue on the problem from the previous post now that I have gotten more clarity as to what I think it really could be and say that I have gotten some goods results with the nested lognormal approach. To see the code for the nested lognormals I have provided the code in the previous post (linked at the top).

Thank you anyone for their two cents.