How to left censor with an unknown limit of detection?

Continuing the discussion from Priors on imputed values for Lognormal censored nonlinear model cause divergence:

I’m working with data that is collected by a spectrophotometer (basically, photons per sample), which can be used to estimate the density of certain bacteria in the sample.

There’s some logical ways to estimate the limit of detection, based on calibration methods etc. But I am also curious about having Stan estimate these limits of detection from data.

The problem I encounter when attempting such an estimation, is that I do not have the data separated into N_obs and N_cens a priori. So the models in the Stan manual do not work.

I think that one of the problem is that I have multiple predictors linked together by some nonlinear function, so that each y has a corresponding expected value X, but if that value falls below the detection limit, the expected value itself should be instead observed as some random noise around the detection limit. That would correspond to the machine picking up random small signals corresponding to background noise but not really being able to trace the actual density as it dips below the detection limit.

In simplest terms this is what I’m trying to do

data {
int N;
vector[N] y;
//predictor information
vector[N] x1;
vector[N] x2;
}
parameters {
// mu is a global parameter related to predictors x1 and x2 by some function
real mu;
real<lower=0> sigma;
real<upper=min(y_obs)> L; 
}
transformed parameters{
// expected value for each y based on mu and predictor info
vector[N] X = function(mu,x1,x2); 
}
model {
  L ~ ..;
  mu ~ ..;
  sigma ~ ..;

  for(i in 1:N){
    if(X[i]> L)
    target +=  normal_lpdf(y[i]| X[i],sigma);
    else
    target += lognormal_lcdf(L | X[i],sigma);
  }
 
}

Of course, this doesn’t work. I found out that the pedantic mode will return this informative warning:

    `Warning in ... A control flow statement depends on parameter(s):`

I really do wonder if I can fit something like this though in Stan. I can obviously simulate data from this generative model. In fact, I would really be interested also to know if the random noise around the detection limit can be separately modelled, something like sigma for the observed and sigma_cens for the noise around the detection limit.

Here is a reprex that is based on the process I explained written for R

require(tidyverse)
## true parameters for simulation
mu = 3; sigma = 0.01; sigma_cens = 0.005;L=1.85;
set.seed(2)
reprex_cens_dat = tibble(
  x1 = seq(0,0.5,length.out=1e2), # first predictor
  x2 = rep(c(2,10),each=50),# second predictor
  X = mu*x2/(1+x2/exp(-x1))) %>% # formula involving mu and predictors
  mutate(
    X_cens = ifelse(X < L, rnorm(n(),L,sigma_cens),X), # expected value with censoring
    y = rnorm(1e2,mean=X_cens,sd=sigma)) # observed value with experimental noise
# visualize: red points are actual expected values, black points are measured value with censoring around L
reprex_cens_dat %>% ggplot(aes(x1,y)) + geom_point() + geom_point(aes(y=X),color='red4',size=0.5)

If I understand correctly, your scenario reflects “truncation”, not “censoring”, and the manual describes how to handle truncation at an unknown value here.

Ah, reading your example I understand better that it’s not that you have either truncated or censored data, but are concerned about properly capturing that at some threshold for X the data transitions from signal-plus-noise to noise-only. But I think you’re missing that a simple mean-and-scale model of signal-and-noise will capture that behaviour automatically; you have to use a lognormal likelihood (else after X hits zero further decreases reflect an increase in SNR; presumably negative values for X are non-sensical, so modelling it on the log-scale is reasonable). The one way I could see this not fitting is if you have an automatic amplifier as part of your measurement process and don’t have access to the amplifcation levels, in which case you’d want to add amplification level as a latent parameter that acts as a multiplier on both X & sigma.

Sigh, I need to do better at reading/code-running before answering. Running your code I see that you’re generating functions that look like they should use a change-point model.

Thanks for your response.

I should have been clear that my reprex contains a totally arbitrary function and not the actual function I am using (which is more detailed in the original post). Below is a reprex containg the function that actualy relates x1 and x2 to y.

I think your second answer is the most illuminating for me, though I am not sure I fully understand it yet. If the measurement is lognormally distributed, the variance affects the measurement at the scale of the measurement, so doesn’t this imply the variance parameter will just be a percentage of the measurement, whatever the measurement is, i.e. absolute magnitude of variance will shrink as the measurement itself appraoches zero? Then I shouldn’t have any problems with measuring low bacterial densities.

But what I really see is that, for example, my spectrophotometer values correspond to actual densities for the range of measurements 1-0.1, but below 0.1 the actual values should, say, continue down to 0.01 (because I diluted the bacteria a further 10 times) but instead it hovers around 0.1. So I am still not sure what to do with my hunch that this is a sort of “detection limit” which implies censoring (or truncation?).

Here is a reprex resembling the actual data I’m working on:

# note now instead of just one mu, there are 3 mu's (mu_1 etc.), 
# and there are also 2 known parameters pi_1 and pi_2 (that will be fed in using data block)
# if you want biological intuition about these parameters and what htey mean (optional)
#( x2 is the dose of the drug, and x1 is different for different bacteria)
# (pi_1 is the amount of bacteria added to test tube and pi_2 is time waited)

pi_1=1e7; pi_2 = 22;
mu_1=1.8e9;mu_2=0.15;mu_3=0.5;
sigma = 0.8; sigma_cens = 0.25; L = 1e7; # as before L is the bacterial count below which I can't detect differences
set.seed(3)
sim_dat = expand_grid(x1=c(0.3,0.5,0.7), ## predictor 1
                      x2=c(0,(50/1.5^14)*1.5^(0:14)), ## predictor 2
                      rep=1:3# rep is experimental replicate 
) %>% 
  mutate(
## calculating true bacterial density using nonlinear formula
    X = pi_1*mu_1/(pi_1 + (mu_1-pi_1)*exp(((x2*mu_2 - mu_3*x1)*pi_2)/(x2+mu_3))), 
## applying noise to measurement
    y = ifelse(X < L, rlnorm(n(),meanlog=log(L),sdlog=sigma_cens),rlnorm(n(),meanlog=log(X),sdlog=sigma)) 
    ) 

## this visualization shows the three different bacteria (different x1 in different panels)
## for each one the actual measurement is in black and the true numbers are in red
sim_dat %>%   ggplot(aes(x2,y)) + geom_point()+
  scale_alpha_manual(values = c(1,0.2))+
  scale_x_continuous(trans=scales::log1p_trans())+
  geom_hline(aes(yintercept=L))+
  scale_y_log10()+
  facet_grid(~x1) + theme_classic() + geom_point(aes(y=X),color='red3')

Update: I think I have an idea how to proceed but not sure if I’m modelling it correctly.

The gist of the idea is that, on top of my lognormally distributed outcome, I have a constant machine error \epsilon \sim \mathcal{N}(LOD,\tau) slapped on top of all observations, which naturally leads to not being able to detect things below L, hence my earlier attempts at pursuing a “Limit of Detection” modelling route.

The reason I suspect both a lognormally distributed outcome (experimental noise) as well as a tacked on normal noise (machine noise) is from the data itself.

I can use the same machine to measure bacterial density in two different ways (two different types of light that correlate with density). when I plot them one against the other on a log-log scale, I get the following plot

In high values for both measurements (high density) the errors have one width (lognormal errors), but as the absolute values of measurements get smaller, the noise width increases.( Something you can also see from the plot is that measurement light 2 has a hard time detection values below ~0.1, so one of the measurements is better than the other. In reality, I’m only using measurement light 1, but I know too at some low values measurement light 1 will stop to correlate with the true bacterial density and become noise).

When I simulate data with the following logic I get curves that are very similar to the dose-response curves in my experiment:

y_{obs} \sim Lognormal(true\,density,\sigma) + normal(L,\tau)

The “true_density” (mean densities without experiment or machine noise) are in red , the lognormal term (actual density in particular experiment) is in turquoise, and observed values in black

(the reprex to make this is below and isvery similar to the code I used in the last comment in this thread)

That being said, I’m not sure at all how to model this in Stan and I am smelling that some re-parametrization and theory will greatly simplify the ugly statement above.

Here’s the reprex for the simulation

require(tidyverse)
#  there are 3 unknown mu parameters (mu_1 etc.), 
# and there are also 2 known parameters pi_1 and pi_2 (that will be fed in using data block)
pi_1=1e7; pi_2 = 22;
mu_1=1.8e9;mu_2=0.15;mu_3=0.5;
sigma = 0.8; sigma_cens = 0.25;
set.seed(3)
sim_dat = expand_grid(x1=c(0.3,0.5,0.7), ## predictor 1
                      x2=c(0,(50/1.5^14)*1.5^(0:14)), ## predictor 2
                      rep=1:3,# rep is experimental replicate 
) %>% 
  mutate(
    L = recode(x1, `0.3` = 1e7, `0.5` = 7e6, `0.7` = 2e6),# L is the unknown level of machine noise for each type of bacterial sample -- each is measured in a different way so it has a different limit of detection
    LOD_ID = as.numeric(as.factor((x1))), # bacteria ID, corresponds to the grouping term for which LOD in the model
    X = pi_1*mu_1/(pi_1 + (mu_1-pi_1)*exp(((x2*mu_2 - mu_3*x1)*pi_2)/(x2+mu_3))), ## calculating true_density (mean of bacterial density for conditions across all samples)
    Xln = rlnorm(n(),meanlog=log(X),sdlog=sigma), # experiment noise leads to lognormal realization of X
    y = Xln + rnorm(n(),mean=L,sd=0.25*L), # finally adding machine noise
  ) %>% as.data.frame

sim_dat %>%   ggplot(aes(x2,y)) + geom_point()+
  scale_alpha_manual(values = c(1,0.2))+
  scale_x_continuous(trans=scales::log1p_trans())+
  geom_hline(aes(yintercept=L))+
  scale_y_log10()+
  facet_grid(~x1) + theme_classic() +
  geom_point(aes(y=Xln),color='darkturquoise',alpha=0.5) + 
  geom_point(aes(y=X),color='red3') 

Any guidance appreciated!

Another update: I seem to be getting good results with the following Stan syntax below. I found non-centered parametrization to be necessary for convergence. Something still feels a little un-principled about the way I am writing this, but I am not smart or experienced enough to know how to simplify this idea/model further.

That being said, I think machines like spectrophotometers do not want to give users negative values (there cannot be negative concentrations), so they are probably programmed to tack on a certain baseline noise to make sure samples near zero don’t dip below. This means that this problem should be quite common. I wonder if I should make a separate post about this to make the question more focused in the future. For now here is the model for posterity (sorry for the likely-overused Bayesian pun)

data{
int N; //number of observations
int N_LOD; //number of detection limits -- > number of types of measurements in my case (different for each bacteria, so in my case, N_LOD is the number of different bacteria studied
array[N] int<upper=N_LOD> ID_LOD; // index for which bacteria is measured in which observation
... //other predictors here
array[N] real<lower=0> yobs;
}
parameters{
   ... //here I have model parameters that define the expected value given predictors x1,x2... --> call it "mu"
   real<lower=0> sigma; // experimental noise for estimating how mu is realized in different replicates (deviation between replicates basically)
   vector<lower=0>[N_LOD] LOD; //limits of detection for the different bacteria
   real<lower=0> sigma_cens; //machine noise scale parameter
   vector[N] realized_density_raw; // non-centered variable for the realized density

}

transformed parameters{
   ...// here I calculate  a vector[N] mu based on predictors
   
  //below: noncentered parametrization of realized density (before adding constant machine noise)
  vector[N] realized_density = exp(realized_density_raw*sigma + mu);

  // finally, adding machine noise with mean LOD (different for different bacteria)
  vector<lower=0>[N] censored_density = realized_density + (LOD[LOD_ID]);
  
}
model{
   ...//other parameter priors not shown
  sigma ~ exponential(1);
  sigma_cens ~ exponential(1);
  LOD ~ lognormal(log(1e7),1); //can't detect below around 10 million in my case, depending on bacteria of course
  realized_density_raw ~ std_normal();

  target += lognormal_lpdf(yobs|log(censored_density),sigma_cens);