Nonlinear negbinomial distributional model


I am analyzing bacterial count data and looking at the difference in contamination rates between samples handled by trained professionals vs untrained workers, trying to estimate the effect of training in reducing contamination.

I have good reason to believe that the the data is negbinomial distributed and that both negbinomial parameters. mu and shape/phi/theta/size (the scale parameter with unfortunately so many names), are both affected by training.

However, the problem is that samples arrive in the lab for bacterial testing in an imperfect manner: specifically, they may arrive too warm or too late, which leads whatever contaminants there are to increase in number.

I created a simulated dataset to illustrate this. We know these bacteria cannot grow below 6 degrees celcius, but above it they grow at approx

N(t) = \begin{cases} N_{0}e^{rt(T-6)^{2}}\text{if } T>6\\ N_{0}\hspace{31pt}\text{ if } T≤6 \end{cases}

where r is an unknown growth rate (which is intrinsic to the specific contaminating bacteria and thus the same for all samples), t is the sample age and T is the sample temperature. This relationship is known as the Ratkowski curve, and it predicts downto the minimum temperature (6). Below 6 degrees growth is zero, not growing again as would be implied if I didn’t use a conditional expression as shown above (the original formulation involves a square root term on the growth rate, thus avoiding growth below minimum temperatures, but I have to square it here to get the correct expression).

I am not so much interested in knowing the growth rate r as much as correcting the effect of sample age and sample temperature on the effect of treatment.

I include a reprex below to show both the effect of sample age and temperature on distorting the negbinomial parameters, and to test if my brm call works. In making the simulated dataset I simplified the expression above by creating a new variable, the excess temperature or \Delta T, which turns all sample temperatures ≤6 → 0 and all temperatures > 6 → T-6. I’m thus working only with the excess temperature above minimum. Note the original sample_temp column of the data.frame is in Kelvins (had to avoid using negative temperatures for another analysis).

This modification effectively reduces my model to N_{t}=N_{0}e^{rA},\text{where }A=t*\Delta T^{2}

Thus I may write my model in the following way.

Final bacterial Count ~ negbinomial(mu,phi)

mu,phi = Training * exp (r * A)

I have so far failed incorporating this nonlinear formula into brms and correctly estimating the parameters r and the effect of Training on simulated data.

# see code below for reprex
sim_dat_tempage = 
# adding predictor column "Training"
  data.frame(Trained=c('Untrained','Trained')) %>% 
# adding mu and phi parameters for untrained and trained samples
# Untrained samples are contaminated with mu = 1000 and phi = 0.01
# Trained samples containated with mu = 1 and phi = 0.1
         phi=ifelse(Trained=='Untrained',0.01,0.1)) %>% 
  rowwise() %>% 
# Generating random data
  mutate(count=map2(mu,phi,~rnbinom(n=1e3,mu=.x,size=.y))) %>% 
  unnest(count) %>% ungroup() %>% 
# Adding sample age and sample temp column,
# Distributed roughly as they are in my real dataset
  mutate(sample_age = rgamma(n(),shape=1.64,rate=0.06),
         sample_temp = rlnorm(n(),meanlog=5.63,sdlog=0.016),
#Adding the bacterial count data after growth given growth formula above
         count_grown = round(ifelse(sample_temp<(6+273.15),
# Finally, creating the variable A above, which is the product of the sample age
# and the excess temperature above 6 degrees
         A = sample_age*ifelse(sample_temp<(6+273.15),0,(sample_temp-(6+273.15))^2))

Optional code that shows how using count_grown underestimates improvement in expected incidence of contamination due to training

# using the formula for variance based on mu and phi for the negative binomial
summary_dat = sim_dat_tempage %>% group_by(Trained) %>% 
            phi_grown=mean(count_grown)^2/(var(count_grown)-mean(count_grown))) %>%

# x axis for contaminating bacteria
contam_grid = seq(1,10)
contam_prob_trained_i = pnbinom(contam_grid,mu=summary_dat[1,2],size=summary_dat[1,3],lower.tail = F)
contam_prob_untrained_i = pnbinom(contam_grid,mu=summary_dat[2,2],size=summary_dat[2,3],lower.tail = F)
contam_prob_trained_grown = pnbinom(contam_grid,mu=summary_dat[1,4],size=summary_dat[1,5],lower.tail = F)
contam_prob_untrained_grown = pnbinom(contam_grid,mu=summary_dat[2,4],size=summary_dat[2,5],lower.tail = F)
# motivation for including temperature and age
# incidence rate ratio is  underestimated
     xlab='Contaminants count in sample',ylab='incidence rate ratio Trained vs. Untrained')
legend(1, 4, legend=c("True count", "Grown counts"),
       col=c("black", "darkred"), lty=1:2, cex=0.8)

Finally, here is (one of) my attempts at a brms call. I have three problems:

  1. There is no modelled coefficient for training status is estimated, only for the growth parameter r (which I don’t really care about except for diagnostics in this case).
  2. I cannot manage to include a separate prediction for the auxiliary parameter shape or phi , otherwise I get the error Error: The following variables can neither be found in 'data' nor in 'data2':'r'
  3. The sampling is very variable between chains, and the posterior for r has a sharp peak around 0. I would ideally like r to be log-linked as well, which I thought brms does since phi and mu parameters are both log linked, but i’m not sure if that’s actually happening.
fit_sim_tempage = brm(
  bf(count_grown ~ Trained*exp(r*A),
     family=negbinomial(link_shape = 'log'), 
      r~1,nl = TRUE),
    data = sim_dat_tempage,
  prior=prior("normal(-5,3)" ,nlpar='r') ,
  chains=4,cores=4,sample_prior = T

Any advice greatly appreciated

1 Like

Hello, what about something like:

brm(bf(count_grown ~ start_count*(exp(exp(r)*A)), 
r ~ 1, 
start_count ~ 1 + trained, 
nl = TRUE), 

This implements the categorical effect of trained on the latent variable start_count, and models r on the log-scale so it will be positively constrained. The logarithm of the response count_grown is predicted as the negbinomial() uses the log link by default.

1 Like

Hi thank you kindly for your suggestion. It seems I am specifying my priors wrong or something but I can’t get the sampling to work well.

I tried a bunch of different ones but in my mind these are logical
prior = c(set_prior("student_t(3,0,5)", class="b",nlpar='startcount'), set_prior("normal(-5,3)",nlpar='r')),

rate r itself is small and when modelled on the logscale should be negative, and the class b ‘startcount’ parameter (had to remove the underscore for brms) should also be modelled on the log scale and I’m using a prior centered at zero which I think is not surprising for a negbinomial.

The error is neg_binomial_2_log_lpmf: Log location parameter[2] is -inf, but must be finite!

In addition, I re-created the simulated dataset above, but using the same shape parameter phi for both trained and untrained groups. It appears that the growth process strongly affects the apparent phi of the data, and for this reason, I think it would be important to fit this as a distibutional model. If the shape parameter appears different in the transformed (count_grown) data, it is good to at least check if back-transforming (correcting for growth, I mean) yields similar or in fact different phi values for both datasets.

I remade sim_dat_tempage but with phi == 0.1 for both groups. The apparent phi for the untrained group and trained group, when using the count_grown data, is 0.001 and 0.09, respectively.

I tried to make a distributional nonlienar model using brms:

fix_grown_formula = brms::brmsformula(family=negbinomial(link = "log", link_shape = "log"),
                       count_grown ~ n0*exp(exp(r)*A),shape~n0*exp(exp(r)*A),
                        nl = TRUE)

fix_grown_model = brm(fix_grown_formula,
                      prior = c(set_prior("student_t(3,0,2)", class="b",nlpar='n0'),
                      chains=8,data=sim_dat_tempage,backend='cmdstanr',sample_prior = T)

This model does not run, however, and gives the following error
Error: The following variables can neither be found in 'data' nor in 'data2': 'n0', 'r'

I know the brms formula syntax can get quite sophisticated. I tried adding several bf() iteratively but to no avail.