Hi all,
I’m currently working on an R package that models the mean (\mu) and overdispersion (\theta) of single cell gene expression (a strictly non-negative, overdispersed count) using a Negative-binomial hierarchical model with random effects per-gene. I’m using variational inference through the meanfield algorithm via brms
with the cmdstanr
backend to fit the model. The model worked well on simulated data i.e., the posterior estimates were relatively close to Frequentist estimates of the same statistics, and I was able to generate credible intervals for the mean and variance of each gene’s expression after fitting the model. However, when applying it to a real dataset I’ve observed wildly over-inflated estimates of the mean expression for each gene on the order of e.g., 5x10^55 as opposed to the “true” mean of ~4. The basic model and priors I’m using are as follows; y_{ig} is the expression of gene g in cell i, and the subscripts 0 and g indicate a global intercept and a gene-specific random effect, respectively:
Assuming we have a dataframe called expr_df
containing 2 columns, one for gene expression called count
and another for gene name called gene
, this is the code I’m using to fit the distributional regression model:
model_priors <- c(brms::set_prior("normal(0, 2)", class = "Intercept", dpar = "mu"),
brms::set_prior("student_t(3, 0, 2)", class = "sd", dpar = "mu"),
brms::set_prior("normal(0, 2)", class = "Intercept", dpar = "shape"),
brms::set_prior("student_t(3, 0, 2)", class = "sd", dpar = "shape"))
model_formula <- brms::bf(count ~ 1 + (1 | gene), shape ~ 1 + (1 | gene))
brms_fit <- brms::brm(model_formula,
prior = model_priors,
data = expr_df,
family = brms::negbinomial(link = "log", link_shape = "log"),
chains = 3L,
iter = 1000L,
warmup = 250L,
thin = 5L,
cores = 3L,
threads = 2L,
normalize = FALSE,
silent = 2,
backend = "cmdstanr",
algorithm = "meanfield",
stan_model_args = list(stanc_options = list("O1")),
seed = 312)
Some things I have tried:
- Switching the spread of the priors to be heavier-tailed (did not work)
- Using the pathfinder and fullrank algorithms instead of meanfield (both took way, way longer and so aren’t really useable)
- Different random seeds (this was a bit of a hail mary I suppose)
- Increasing the iterations per-chain (only made things slower)
Any help and / or ideas would be very well-appreciated - this is the first time I’ve gone deeper into Bayesian modeling and so I assume there’s something I’m missing.
Thanks!
-Jack