Nonlinear negbinomial for exponentially growing count data

This is a followup to a question I detailed under the brms section (Nonlinear negbinomial distributional model). Since I couldn’t figure out how to do the model in brms I switched to Stan directly.

The full description of the problem is in the link above, but I will explain briefly the setup. I have samples contaminated with bacteria, where the initial amount of contaminating bacteria is assumed to be negbinomially distributed. Some samples arrived from trained professionals and others from laypeople, and my regression question is the effect of training on reducing contamination rates (basically the mean rate \mu of the negbinomial, although I suspect the shape parameter \phi may be affected as well, though I haven’t included that in this current code).

However, the samples arrive at the lab at different times, and at different temperatures, which means that they have grown from their initial amounts, and this distorts the data. Fortunately, I know the time and temperature at which they arrived in the lab, so I can correct for that.

I combine temperature and sample arrival time information into a growth parameter A, which allows me to specify the relationship between actual observed counts N(t) and initial counts N(0), as N(t)=N(0)e^{r \cdot A}. I would like to model

N(0)_{i} \sim negbinomial(\mu_{i},\phi)
log(\mu_{i} )= \beta_{Training}
\beta_{Training} \sim \mathcal{N}(0,5)
log(\phi) \sim \mathcal{N}(-3,3)
r \sim \mathcal{N}(-3,1.5)

Below is a reprex data.frame to generate simulated data in R

# see code below for reprex
set.seed(12)
require(tidyverse)
sim_dat_tempage = 
  # adding predictor column "Training"
  data.frame(Trained=c('Untrained','Trained')) %>% 
  # adding mu and phi parameters for untrained and trained samples
  # Untrained samples are contaminated with mu = 1000 and phi = 0.01
  # Trained samples containated with mu = 1 and phi = 0.1
  mutate(mu=ifelse(Trained=='Untrained',1000,1),
         phi=ifelse(Trained=='Untrained',0.1,0.1)) %>% 
  rowwise() %>% 
  # Generating random data
  mutate(count=map2(mu,phi,~rnbinom(n=1e3,mu=.x,size=.y))) %>% 
  unnest(count) %>% ungroup() %>% 
  # Adding sample age and sample temp column,
  # Distributed roughly as they are in my real dataset
  mutate(sample_age = rgamma(n(),shape=1.64,rate=0.06),
         sample_temp = rlnorm(n(),meanlog=5.63,sdlog=0.016),
         #Adding the bacterial count data after growth given growth formula above
         count_grown = as.integer(round(ifelse(sample_temp<(6+273.15),
                                    count,count*exp(0.002*sample_age*(sample_temp-(6+273.15))^2)))),
         # Finally, creating the variable A above, which is the product of the sample age
         # and the excess temperature above 6 degrees
         A = sample_age*ifelse(sample_temp<(6+273.15),0,(sample_temp-(6+273.15))^2),
         b_Trained = ifelse(Trained=="Untrained",0,1),
         b_Untrained = ifelse(Trained=="Untrained",1,0))

I wrote the following program below which works ONLY IF I specify initial values:
init_fun <- function() list(r = log( phi = log(0.1) ,b = log(c(1000,1)))

If no initial values I get the (no chains finished successfully error (on cmdstan)

Then it works but gives the following error
Exception: neg_binomial_2_log_lpmf: Log location parameter[3] is inf, but must be finite!
and if this warning occurs often then your model may be either severely ill-conditioned or misspecified. indeed the error occurs often

I am interested in better ways of coding this problem since the estimates even when the chains converge are not very good. I should add that in my real data set I may not have good guesses for initial values (I can easily do here because it’s simulated data).

Also, conceptually, I am mostly doing this because I am interested in predicting the amount of samples that have low amounts of contamination (0-10 bacteria), and a normal negative binomial on bacterial counts after growth underestimates these low counts (because it has to handle this sort of over-overdispersion due to growth, if that makes sense). So I also have this statistical question, if it would be appropriate in this case to filter the data to samples that haven’t grown “too much” (keep only the fresh and cold samples basically) and perform a right-censored negative binomial with anything beyond 10 bacteria getting censored, therefore focussing the attention to that range I’m interested in.

Any advice appreciated, thanks!

data {
  int<lower=1> N; //number of data points
  int<lower=1> S; //number of training statuses (2)
  array[N] int bac; // bacteria contamination counts
  matrix[N,S] X; // Training status
  vector[N] A; //age and temp multiplier
}

parameters {
  real r; // unknown growth rate
  vector[S] b; // mean rates for trained/untrained samples
  real phi; // a.k.a shape parameter or phi
}

transformed parameters{
  real exp_phi = exp(phi); // log link helps dealing with small shape parameters
}
model {
    //likelihood
    vector[N] mu; //mean rate of counts
    vector[N] init_count = rep_vector(0.0,N); // latent variable for initial counts
    init_count += X*b;
    for(i in 1:N){
      mu[i] = init_count[i] * (exp(exp(r) * A[i]));
    }
    target += neg_binomial_2_log_lpmf(bac | mu, exp_phi);
    
          //priors
  r ~ normal(-3,1.5);
  phi ~ normal(-3,3);
  b ~  normal(0,5);
}

1 Like

Hi, if I remember correctly, neg_binomial_2_log expects the log mean as input. From your code, it seems you are providing the expectation on the natural scale. As such, your NB model currently can’t achieve expectations less than 1, as the input mu will always be greater than zero.

Define the log expectation for neg_binomial_2_log as:

mu_log =  log(init_count) + rate * A

Edit: sorry got confused by the double exponents. For legibility, I would recommend that you use define all the parameters on the natural scale with the appropriate bounds. You would have to adjust the definition of your priors then as well. The rate parameter in my example code above is defined on the natural scale!

As for shape parameter phi: assuming that it is the same for all observations means that you assume that each observed number of bacteria is based on the same volume of sample that you tested. If the volumes of the samples vary across observations, both the mean and the shape parameter should be scaled linearly with the volume of the sample (i.e. For a twice as big sample, phi should be set twice as big). This has to do with how sums of X number NB distributed variables with the same mean and shape parameters also follow a NB distribution with

mean_{sum}=mean*X
shape_{sum}=shape*X

Just out of curiosity : how convinced are you that a NB distribution is needed instead of a Poisson? Do the bacteria cluster together in the medium while you sample it? If not, I would expect a Poisson to suffice.

Thank you very much for all these pointers.

Yes I think I meant to model in the log scale but somehow it complete escaped me that I will have to log transform the growth equation. My bad!

I took the suggestion of modelling the growth rate r in the log scale (therefore the double exponential) from the previous post I had on this problem from another forum user. The growth rates I expect are very low because the of the timescales considered. Sorry about legibility. Perhaps I can avoid using such low numbers by changing the units of the temperature/age parameter A, that would circumvent needing to exponentiate r.

The shape parameter point is very interesting, I had not thought about the biological meaning of phi in this way; however, the tests are done by a third party lab and are very standardized, with regards to volume for example as well, but it is for this same reason that we have the temperature age problem (because they all have to be transported to this same place).

The dataset is over-dispersed and has a lot of zeroes as well as spanning >4 orders of magnitude. I believe that the reason for this is that contamination can arise at several steps of sample collection, and each step has its own propensity for contaminating and also the number of contaminants introduced could be different in different steps. This research proprietary and I cannot provide more information unfortunately, but for example contamination from the person themselves may be airborne, and contamination from packaging the sample may be because of contaminants in the flask, both of these will lead to a different size ‘clump’ of contaminants. Still with regards to dispersion, using the uncorrected contamination rates,negbinomial/poisson as well as those with added hurdle/zeroinflation all have trouble dealing with the levels of overdispersion in the data. Specifically they all seem to trade off trying to predict the many irregularly high counts in some samples, with reducing predictive power in the low contamination range (0-10). So that is to say, I think even after temperature and age correction, the data will still be quite dispersed (too much for poisson) but it may be a good idea to try ZIP on the temperature/age corrected counts.

I would ideally like to model phi as well( a so-called distributional model), and introduce a multilevel component, but I provided here the simplest case because I am building the model stepwise. In my real dataset the shape parameter is rather different between trained and untrained treatments. But, I think it is important to note firstly that even with the same \phi initialy, the temperature age distortion affects both \mu and \phi. So I want to check if correcting for this distortion also results in making the shape parameters more similar between treatments.

1 Like