I am trying to model blood biomarker data collected longitudinally in a sample of patients vs. controls. There are 4 sampling timepoints, with patient data collected either at timepoint 1 or 2 (I have considered combining them into one timepoint, but biologically speaking it is better to keep them separate). The biomarker concentrations are heavily left-censored, with a large proportion of values below assay detection limits (< LOD, or non-detects).
I believe the best modelling strategy is a two-part mixed-effects mixture model, with bernoulli probability for predicting detects vs. non-detects, and a lognormal truncated model for the rest. Please advise if this is wrong, but I originally thought to do a mixed-effects censored Tobit regression, but I found the proportion of non-detects significantly affected parameters too much.
Example code for one biomarker is below
cyt <- "IL_7"
resp_censobs <- paste0(cyt, "_censobs") # All non-detects re-coded to LOD
resp_detect <- paste0(cyt, "_detect") # DETECTED? TRUE/FALSE
lod <- LOD_vals[[cyt]]
bf_det <- bf(
as.formula(paste0(resp_detect, " ~ ", "Group_Mixed * timepoint + Sex
+ Age_centred + (1 | p | Subj_ID)")),
family = "bernoulli"
)
bf_pos <- bf(
as.formula(paste0(resp_censobs, " | trunc(lb = ", lod, ") ~ Group_Mixed *
timepoint + Sex + Age_centred + (1 | p | Subj_ID)")),
family = "lognormal"
)
fit_hurdle <- brm(
bf_det + bf_pos + set_rescor(FALSE),
data = df,
backend = "cmdstanr",
chains = 4, cores = 8,
prior = priors,
iter = 4000,
warmup = 1000,
save_pars = save_pars(all = TRUE),
control = list(adapt_delta = 0.999, max_treedepth = 15),
init = 0
)
Please let me know if this is the correct modelling strategy and if there’s any fixes you would recommend