Priors on imputed values for Lognormal censored nonlinear model cause divergence

dankehila · April 26, 2023, 4:00am

I am fitting a dose-response model, or exposure-response model, where the response is bacterial density (called CATcfu in the model, density being the continuous approximation of counts based on kinetic/mass action logic) as a function of added antibiotic AMP0.

The nonlinear regression equation I have derived and won’t explain here, but basically it involves the three parameters K (maximum bacterial density), d (death rate due to antibiotic), I_A (median inhibitory dose of the antibiotic), which I am fitting, and also a hyperparameter sigma which represents the (lognormal) experimental error.

I have additional parameters that I know a priori and those go into the data block (those are the initial bacterial density y0, the duration of antibiotic exposure t and the bacteria’s growth rate gr.

My experimental data have lower limits of detection, i.e. below a certain bacterial density I cannot reliably tell how many there are. These limits of detection change based on the bacteria I study, and as a consequence, different bacteria (with different growth rates) have different limits of detection.

Despite these complexities and nonlinearities the model fits the data OK but fails with some parameters as I explain below. I included a reprex below in R.

I think my main problem is putting priors on the imputed data. Basically, when I try to place the priors

y_cens ~ lognormal(mu_cens, sigma);

The model suddenly terribly diverges and nothing works. However, if I comment out that line, the fit performs much better.

One parameter that never converges even when the line is commented out is the lognormal standard deviation sigma. I think this is happening since the likelihood is estimated for the uncensored and censored data using the same sigma, but there are no sensible priors and bounds on the censored data.

I’d appreciate hints on what I am doing wrong and also, as always, if there is anything else you seee that’s a glaring mistake in my method.

Below is the stan code and reprex in R (using tidyverse)

//bacteria_dose_response.stan
functions{ // within-chain parallelization to speed up sampling
  real partial_sum(array[] real y_slice,
                    int start, int end,
                    vector muz, real sigma){
                      
    return lognormal_lpdf(y_slice | muz[start:end],sigma);
  }
}

data{
  int<lower=1> N; // total number of data points
  int<lower=0> N_cens; //number of censored (below detection limit) data points
  array[N_cens] real<lower=0> censor_points; //detection limit
  array[N] real<lower=0> CAT; //observed CAT CFU/mL
  vector<lower=0>[N] gr; // growth rates
  vector<lower=0>[N] AMP0; // antibiotic concentration
  vector<lower=0>[N_cens] gr_cens; // growth rates
  vector<lower=0>[N_cens] AMP0_cens; // antibiotic concentration
  real y0; //initial density of bacteria
  real time; //antibiotic exposure time
  int grainsize; //for reduce sum
}
parameters{
  real<lower=y0> K; //maximum density
  real<lower=0> I_A; //real antibiotic IC50 i.e. median inhibitory concentration 
  real<lower=0> d; //real death rate
  real<lower=0> sigma; //lognormal standard deviation
  array[N_cens] real<lower=0> CAT_cens; //imputed missing data
  

}
transformed parameters{
//the ab_term is part of the nonlinear formula. To see it in an untransformed way, you can check out the reprex
  vector[N] ab_term = ((AMP0*d - I_A*gr)*time)./(AMP0+I_A) + log(K-y0); //antibiotic nonlinear term, part of the nonlinear formula
  vector[N_cens] ab_term_cens = ((AMP0_cens*d - I_A*gr_cens)*time)./(AMP0_cens+I_A) + log(K-y0); //antibiotic nonlinear term for censored datapoints
  
  //defining expected values for all data inc. censored data
  vector[N] mu;
  vector[N_cens] mu_cens;
  
  for(i in 1:N){
    mu[i] = log(y0) + log(K) - log_sum_exp(ab_term[i],log(y0)); //nonlinear formula in a log transform
  }
  
  for(i in 1:N_cens){
    mu_cens[i] = log(y0) + log(K) - log_sum_exp(ab_term_cens[i],log(y0));
  }
}
model{

  K ~ lognormal(log(1e9),1); 
  d ~ normal(0,0.15);
  I_A ~ normal(0.5,0.5);
  sigma ~ exponential(1);
  sigma_cens ~ exponential(1);
  
  //If I uncomment the line below, the model no longer converges!
  //CAT_cens ~ lognormal(mu_cens,sigma);


  target+= reduce_sum(partial_sum,CAT,grainsize,mu,sigma);
  target+= reduce_sum(partial_sum,CAT_cens,grainsize,mu_cens,sigma);
}
generated quantities{
  array[N+N_cens] real l_pred = append_array(lognormal_rng(mu,sigma),lognormal_rng(mu_cens,sigma)); //for pp_check
}

Reprex

library(cmdstanr);library(tidyverse)
bacteria_dose_response = cmdstan_model('bacteria_dose_response.stan')
y0=1e7;K_actual=1.8e9;d=0.05;I_A=0.33;time=22; # true parameters of simulation
set.seed(300)
sim_dat = expand_grid(gr=c(0.3,0.5,0.6), ## gr is growth rate
            AMP0=c(0,(50/1.5^14)*1.5^(0:14)), ## AMP0 is antibiotic concentration
            rep=1:5# rep is experimental replicate 
            ) %>% 
  mutate(
    CATcfu = y0*K_actual/(y0 + (K_actual-y0)*exp(((AMP0*d - I_A*gr)*time)/(AMP0+I_A))), ## calculating true bacterial density based on nonlinear formula
    DL = recode(gr, `0.3` = 4e6, `0.5` = 1e7,`0.6`=1.7e7), # detection limits
    CATcfu_cens=ifelse(CATcfu<=DL,DL,CATcfu), # censoring data below detection limit
    CATcfu_werr = rlnorm(n(),meanlog=log(CATcfu_cens),sdlog=1), ## adding experimental noise
    cens = ifelse(CATcfu_werr < DL,1,0)) %>% as.data.frame ## adding censoring column

## visualization of data, panels are the different bacteria (with different growth rates and different detection limits
sim_dat %>%   ggplot(aes(AMP0,CATcfu_werr)) + geom_point(aes(alpha=as.factor(cens)))+
  scale_alpha_manual(values = c(1,0.2))+
  scale_x_continuous(trans=amp_breaks,breaks=round(ampconcs[1:7*2-1],2)) +
  scale_y_log10()+ 
  facet_grid(~gr) + theme_classic()

## dataset in list format for Stan
sim_list = list(
  N= nrow(sim_dat[sim_dat$cens==0,]),
  N_cens= nrow(sim_dat[sim_dat$cens==1,]),
  censor_points=sim_dat[sim_dat$cens==1,'DL'],
  #N_G = length(unique(cat_dat$Genotype)),
  #ID = cat_dat$Genotype,
  CAT = sim_dat[sim_dat$cens==0,'CATcfu'],
  gr = sim_dat[sim_dat$cens==0,'gr'],
  AMP0= sim_dat[sim_dat$cens==0,'AMP0'],
  gr_cens = sim_dat[sim_dat$cens==1,'gr'],
  AMP0_cens= sim_dat[sim_dat$cens==1,'AMP0'],
  y0 = y0, # this is known
  time = 22, # this is known
  grainsize=1
)
## initial values, not necessary
init_sim = function() list(K = 1e9, I_A = 0.3, d = 0.21 sigma = 1)

# fit
fit_sim = bacteria_dose_response$sample(
  data = sim_list, 
  init = init_sim, 
  iter_warmup = 2000,iter_sampling = 2000
)

# note sigma and lp__ don't get properly sampled at all, but other parameters are better
`

Bob_Carpenter · May 1, 2023, 9:48pm

Are you sure that y_cens here only has positive values?

One thing you can try to do is print the values going in and see what they look like. If this line is causing numerical errors, you should be able to see that. I’d recommend pulling out into a loop to test:

print("mu_cens = ", mu_cens, ";  sigma = ", sigma);
for (n in 1:N_cens) {
  Cat_cens[n] ~ lognormal(mu_cens, sigma);
  print("target() = ", target(), ";  Cat_cens[n] = ", Cat_cens[n]);
}

If this winds up setting the target to something that will cause a divergence, this should let you see what entries cause the problem.

This should be OK. The uncensored data and censored data should both contribute to its fit. So you definitely want to diagnose this problem, too.

dankehila · May 2, 2023, 6:26pm

@Bob_Carpenter Thank you for your response.

y_cens corresponds to bacterial counts as so indeed must always be positive. Since I am drawing from a lognormal and I’m using both the censored and uncensored data to calculate sigma, I assumed I have to use lognormal prior for y_cens so that I can use the same sigma as the uncensored data. Is this so?

I have tried removing the lower bound and also printing out the values as you suggested. The former did not work I believe because it proposed negative values into the lognormal statement.

The latter showed the following behaviour:

without the y_cens~lognormal... statement, mu_cens values match the missing data well, and CAT_cens were incredibly variable and correspondingly the sigma value was between 6-20
with the y_cens~lognormal... statement, mu_cens values starts by proposing logical values and a normal sigma, then sigma explodes to infinity, and as sigma drops back down mu_cens values become extremely negative (basically 0). Correspondingly, CAT_cens go to 0.

I should have also mentioned, perhaps you saw it if you ran the model as well, that when I uncomment the y_cens~lognormal... statement, the estimate of sigma jumps into the 100s, and the censored data is all zeroes. Is there possibly an over/underflowing problem?

I have also tried to :

Switch all bacteria to the same detection limit → doesn’t change anything
Separately estimate separate prior of y_cens_test ~ lognormal(mu_cens,sigma) but not use those values in likelihood, just so I can separately print them alongside the true y_cens (which will be used in the likelihood statement) using the code you provided. But when I add the lines

transformed parameters{
...

vector<lower=0>[N_cens]  y_cens_test;
...
}
model{
...
y_cens_test ~ lognormal(mu_cens,sigma);
...
}

I get the error : log_prob: y_cens_test[1] is nan, but must be greater than or equal to 0.000000

I’m sorry if I am missing something obvious. I’m a microbiologist trying to learn Stan because I really enjoy it but I am a total amateur. Any further advice is appreciated.

dankehila · May 8, 2023, 9:53pm

So I changed the model to integrate out censored values instead of impute. I think I was thinking of the limit of detection as a kind of measurement error, whereas in the Stan manual it fits much better under the discussion of censored/truncated data (doh!).

Also, made the decision to have Stan estimate the detection limits of the bacteria (and further validate our detection limits, which we estimate crudely and don’t actually know to that great of an accuracy).

I believe I am doing everything correctly w/r/t/ integrating out the data. Since every bacteria is measured differently, as before, it has a different limit of detection. However, this model provides a very poor estimate of parameters, especially sigma and limits of detection.

Any advice appreciated. I will keep posting updates here in case anyone in the future finds my struggles useful.

data{
  int<lower=1> N; // total number of data points
  int<lower=0> N_LOD; //number of limits of detection -- different for each counting system
  array[N] int<lower=0,upper=N_LOD> LOD_ID; // which counting system is being used 
  array[N] real<lower=0> CAT; //observed CAT CFU/mL
  vector<lower=0>[N] gr; // growth rates
  vector<lower=0>[N] AMP0; // antibiotic concentration
  real y0; //initial density of bacteria
  real time; //antibiotic exposure time
}
parameters{
  real<lower=y0> K; //maximum density
  real<lower=0> I_A; //real antibiotic IC50 i.e. median inhibitory concentration 
  real<lower=0> d; //real death rate
  real<lower=0> sigma; //lognormal standard deviation
  real<lower=0> sigma_cens;//lognormal sd around LOD
  vector<lower=0>[N_LOD] LOD; //limits of detection for the different counting systems
  

}
transformed parameters{
//the ab_term is part of the nonlinear formula. To see it in an untransformed way, you can check out the reprex
  vector[N] ab_term = ((AMP0*d - I_A*gr)*time)./(AMP0+I_A) + log(K-y0); //antibiotic nonlinear term, part of the nonlinear formula
  
  //defining expected values for all data
  vector[N] mu;
  
  for(i in 1:N){
    mu[i] = log(y0) + log(K) - log_sum_exp(ab_term[i],log(y0)); //nonlinear formula in a log transform
  }

}
model{

  K ~ lognormal(log(1e9),1); 
  d ~ normal(0,0.15);
  I_A ~ normal(0.5,0.5);
  sigma ~ exponential(1);
  sigma_cens ~ exponential(1);
  LOD ~ lognormal(log(1e6),1.5);
  
//likelihood 

  
  for(i in 1:N){
//mu is the mean bacterial count predicted by antibiotic concentration as well as growth rate etc.
// so we have to check if that expectation is below or above the detection limit for that specific bacterium
// I use the indexing LOD[LOD_ID[i]] to call out which Limit Of Detection is in effect
    if(exp(mu[i]) > LOD[LOD_ID[i]]) 
    target +=  lognormal_lpdf(CAT[i] | mu[i],sigma); //if mu is above the LOD, that is the good part of the data
    else
    target += lognormal_lcdf(LOD[LOD_ID[i]] | mu[i],sigma_cens); //if it is below the LOD we must integrate it out
  }

}

Topic		Replies	Views
Uncentered exponential priors and hyperpriors Modeling prior-choice , hierarchical-model	7	1792	March 24, 2023
Divergence in fitting a mixture of censored and truncated data Modeling specification , divergences , diagnostics	0	545	October 18, 2021
Jeffreys prior for a vector variable Modeling specification	6	1402	September 12, 2022
Gaussian mixture model with interval censored data -- where do the divergences come from? Modeling	9	1739	July 11, 2018
Dose Response Model with partial pooling on maximum value Modeling fitting-issues	14	1847	September 1, 2020

Priors on imputed values for Lognormal censored nonlinear model cause divergence

Related Topics