Hi
I am analyzing bacterial count data and looking at the difference in contamination rates between samples handled by trained professionals vs untrained workers, trying to estimate the effect of training in reducing contamination.
I have good reason to believe that the the data is negbinomial distributed and that both negbinomial parameters. mu and shape/phi/theta/size (the scale parameter with unfortunately so many names), are both affected by training.
However, the problem is that samples arrive in the lab for bacterial testing in an imperfect manner: specifically, they may arrive too warm or too late, which leads whatever contaminants there are to increase in number.
I created a simulated dataset to illustrate this. We know these bacteria cannot grow below 6 degrees celcius, but above it they grow at approx
N(t) = \begin{cases} N_{0}e^{rt(T-6)^{2}}\text{if } T>6\\ N_{0}\hspace{31pt}\text{ if } T≤6 \end{cases}
where r is an unknown growth rate (which is intrinsic to the specific contaminating bacteria and thus the same for all samples), t is the sample age and T is the sample temperature. This relationship is known as the Ratkowski curve, and it predicts downto the minimum temperature (6). Below 6 degrees growth is zero, not growing again as would be implied if I didn’t use a conditional expression as shown above (the original formulation involves a square root term on the growth rate, thus avoiding growth below minimum temperatures, but I have to square it here to get the correct expression).
I am not so much interested in knowing the growth rate r as much as correcting the effect of sample age and sample temperature on the effect of treatment.
I include a reprex below to show both the effect of sample age and temperature on distorting the negbinomial parameters, and to test if my brm call works. In making the simulated dataset I simplified the expression above by creating a new variable, the excess temperature or \Delta T, which turns all sample temperatures ≤6 → 0 and all temperatures > 6 → T-6. I’m thus working only with the excess temperature above minimum. Note the original sample_temp
column of the data.frame
is in Kelvins (had to avoid using negative temperatures for another analysis).
This modification effectively reduces my model to N_{t}=N_{0}e^{rA},\text{where }A=t*\Delta T^{2}
Thus I may write my model in the following way.
Final bacterial Count ~ negbinomial(mu,phi)
mu,phi = Training * exp (r * A)
I have so far failed incorporating this nonlinear formula into brms and correctly estimating the parameters r and the effect of Training on simulated data.
# see code below for reprex
set.seed(42)
require(tidyverse);require(brms)
set.seed(42)
sim_dat_tempage =
# adding predictor column "Training"
data.frame(Trained=c('Untrained','Trained')) %>%
# adding mu and phi parameters for untrained and trained samples
# Untrained samples are contaminated with mu = 1000 and phi = 0.01
# Trained samples containated with mu = 1 and phi = 0.1
mutate(mu=ifelse(Trained=='Untrained',1000,1),
phi=ifelse(Trained=='Untrained',0.01,0.1)) %>%
rowwise() %>%
# Generating random data
mutate(count=map2(mu,phi,~rnbinom(n=1e3,mu=.x,size=.y))) %>%
unnest(count) %>% ungroup() %>%
# Adding sample age and sample temp column,
# Distributed roughly as they are in my real dataset
mutate(sample_age = rgamma(n(),shape=1.64,rate=0.06),
sample_temp = rlnorm(n(),meanlog=5.63,sdlog=0.016),
#Adding the bacterial count data after growth given growth formula above
count_grown = round(ifelse(sample_temp<(6+273.15),
count,count*exp(0.002*sample_age*(sample_temp-(6+273.15))^2))),
# Finally, creating the variable A above, which is the product of the sample age
# and the excess temperature above 6 degrees
A = sample_age*ifelse(sample_temp<(6+273.15),0,(sample_temp-(6+273.15))^2))
Optional code that shows how using count_grown underestimates improvement in expected incidence of contamination due to training
# using the formula for variance based on mu and phi for the negative binomial
summary_dat = sim_dat_tempage %>% group_by(Trained) %>%
summarize(mu_init=mean(count),
phi_init=mean(count)^2/(var(count)-mean(count)),
mu_grown=mean(count_grown),
phi_grown=mean(count_grown)^2/(var(count_grown)-mean(count_grown))) %>%
as.data.frame()
# x axis for contaminating bacteria
contam_grid = seq(1,10)
contam_prob_trained_i = pnbinom(contam_grid,mu=summary_dat[1,2],size=summary_dat[1,3],lower.tail = F)
contam_prob_untrained_i = pnbinom(contam_grid,mu=summary_dat[2,2],size=summary_dat[2,3],lower.tail = F)
contam_prob_trained_grown = pnbinom(contam_grid,mu=summary_dat[1,4],size=summary_dat[1,5],lower.tail = F)
contam_prob_untrained_grown = pnbinom(contam_grid,mu=summary_dat[2,4],size=summary_dat[2,5],lower.tail = F)
# motivation for including temperature and age
# incidence rate ratio is underestimated
plot(contam_grid,contam_prob_untrained_i/contam_prob_trained_i,
xlab='Contaminants count in sample',ylab='incidence rate ratio Trained vs. Untrained')
points(contam_grid,contam_prob_untrained_grown/contam_prob_trained_grown,col='darkred')
legend(1, 4, legend=c("True count", "Grown counts"),
col=c("black", "darkred"), lty=1:2, cex=0.8)
Finally, here is (one of) my attempts at a brms
call. I have three problems:
- There is no modelled coefficient for training status is estimated, only for the growth parameter
r
(which I don’t really care about except for diagnostics in this case). - I cannot manage to include a separate prediction for the auxiliary parameter
shape
orphi
, otherwise I get the errorError: The following variables can neither be found in 'data' nor in 'data2':'r'
- The sampling is very variable between chains, and the posterior for r has a sharp peak around 0. I would ideally like r to be log-linked as well, which I thought brms does since phi and mu parameters are both log linked, but i’m not sure if that’s actually happening.
fit_sim_tempage = brm(
bf(count_grown ~ Trained*exp(r*A),
family=negbinomial(link_shape = 'log'),
r~1,nl = TRUE),
data = sim_dat_tempage,
prior=prior("normal(-5,3)" ,nlpar='r') ,
chains=4,cores=4,sample_prior = T
)
Any advice greatly appreciated