This is a followup to a question I detailed under the brms
section (Nonlinear negbinomial distributional model). Since I couldn’t figure out how to do the model in brms I switched to Stan directly.
The full description of the problem is in the link above, but I will explain briefly the setup. I have samples contaminated with bacteria, where the initial amount of contaminating bacteria is assumed to be negbinomially distributed. Some samples arrived from trained professionals and others from laypeople, and my regression question is the effect of training on reducing contamination rates (basically the mean rate \mu of the negbinomial, although I suspect the shape parameter \phi may be affected as well, though I haven’t included that in this current code).
However, the samples arrive at the lab at different times, and at different temperatures, which means that they have grown from their initial amounts, and this distorts the data. Fortunately, I know the time and temperature at which they arrived in the lab, so I can correct for that.
I combine temperature and sample arrival time information into a growth parameter A, which allows me to specify the relationship between actual observed counts N(t) and initial counts N(0), as N(t)=N(0)e^{r \cdot A}. I would like to model
N(0)_{i} \sim negbinomial(\mu_{i},\phi)
log(\mu_{i} )= \beta_{Training}
\beta_{Training} \sim \mathcal{N}(0,5)
log(\phi) \sim \mathcal{N}(-3,3)
r \sim \mathcal{N}(-3,1.5)
Below is a reprex data.frame
to generate simulated data in R
# see code below for reprex
set.seed(12)
require(tidyverse)
sim_dat_tempage =
# adding predictor column "Training"
data.frame(Trained=c('Untrained','Trained')) %>%
# adding mu and phi parameters for untrained and trained samples
# Untrained samples are contaminated with mu = 1000 and phi = 0.01
# Trained samples containated with mu = 1 and phi = 0.1
mutate(mu=ifelse(Trained=='Untrained',1000,1),
phi=ifelse(Trained=='Untrained',0.1,0.1)) %>%
rowwise() %>%
# Generating random data
mutate(count=map2(mu,phi,~rnbinom(n=1e3,mu=.x,size=.y))) %>%
unnest(count) %>% ungroup() %>%
# Adding sample age and sample temp column,
# Distributed roughly as they are in my real dataset
mutate(sample_age = rgamma(n(),shape=1.64,rate=0.06),
sample_temp = rlnorm(n(),meanlog=5.63,sdlog=0.016),
#Adding the bacterial count data after growth given growth formula above
count_grown = as.integer(round(ifelse(sample_temp<(6+273.15),
count,count*exp(0.002*sample_age*(sample_temp-(6+273.15))^2)))),
# Finally, creating the variable A above, which is the product of the sample age
# and the excess temperature above 6 degrees
A = sample_age*ifelse(sample_temp<(6+273.15),0,(sample_temp-(6+273.15))^2),
b_Trained = ifelse(Trained=="Untrained",0,1),
b_Untrained = ifelse(Trained=="Untrained",1,0))
I wrote the following program below which works ONLY IF I specify initial values:
init_fun <- function() list(r = log( phi = log(0.1) ,b = log(c(1000,1)))
If no initial values I get the (no chains finished successfully
error (on cmdstan)
Then it works but gives the following error
Exception: neg_binomial_2_log_lpmf: Log location parameter[3] is inf, but must be finite!
and if this warning occurs often then your model may be either severely ill-conditioned or misspecified.
indeed the error occurs often
I am interested in better ways of coding this problem since the estimates even when the chains converge are not very good. I should add that in my real data set I may not have good guesses for initial values (I can easily do here because it’s simulated data).
Also, conceptually, I am mostly doing this because I am interested in predicting the amount of samples that have low amounts of contamination (0-10 bacteria), and a normal negative binomial on bacterial counts after growth underestimates these low counts (because it has to handle this sort of over-overdispersion due to growth, if that makes sense). So I also have this statistical question, if it would be appropriate in this case to filter the data to samples that haven’t grown “too much” (keep only the fresh and cold samples basically) and perform a right-censored negative binomial with anything beyond 10 bacteria getting censored, therefore focussing the attention to that range I’m interested in.
Any advice appreciated, thanks!
data {
int<lower=1> N; //number of data points
int<lower=1> S; //number of training statuses (2)
array[N] int bac; // bacteria contamination counts
matrix[N,S] X; // Training status
vector[N] A; //age and temp multiplier
}
parameters {
real r; // unknown growth rate
vector[S] b; // mean rates for trained/untrained samples
real phi; // a.k.a shape parameter or phi
}
transformed parameters{
real exp_phi = exp(phi); // log link helps dealing with small shape parameters
}
model {
//likelihood
vector[N] mu; //mean rate of counts
vector[N] init_count = rep_vector(0.0,N); // latent variable for initial counts
init_count += X*b;
for(i in 1:N){
mu[i] = init_count[i] * (exp(exp(r) * A[i]));
}
target += neg_binomial_2_log_lpmf(bac | mu, exp_phi);
//priors
r ~ normal(-3,1.5);
phi ~ normal(-3,3);
b ~ normal(0,5);
}