Wildly inflated mean values in negative-binomial hierarchical model

jr-leary7 · December 5, 2024, 9:57pm

Hi all,

I’m currently working on an R package that models the mean (\mu) and overdispersion (\theta) of single cell gene expression (a strictly non-negative, overdispersed count) using a Negative-binomial hierarchical model with random effects per-gene. I’m using variational inference through the meanfield algorithm via brms with the cmdstanr backend to fit the model. The model worked well on simulated data i.e., the posterior estimates were relatively close to Frequentist estimates of the same statistics, and I was able to generate credible intervals for the mean and variance of each gene’s expression after fitting the model. However, when applying it to a real dataset I’ve observed wildly over-inflated estimates of the mean expression for each gene on the order of e.g., 5x10^55 as opposed to the “true” mean of ~4. The basic model and priors I’m using are as follows; y_{ig} is the expression of gene g in cell i, and the subscripts 0 and g indicate a global intercept and a gene-specific random effect, respectively:

\begin{aligned} y_{ig} &\sim \text{NB}(\mu_{ig}, \theta_g) \\ \log\left(\mathbb{E}[\mu_g]\right) &= \beta_0 + \gamma_g \\ \beta_0 &\sim \text{Gaussian}(0, 2) \\ \gamma_g &\sim t(3, 0, 2) \\ \log\left(\mathbb{E}[\theta_g]\right) &= \alpha_0 + \kappa_g \\ \alpha_0 &\sim \text{Gaussian}(0, 2) \\ \kappa_g &\sim t(3, 0, 2) \\ \end{aligned}

Assuming we have a dataframe called expr_df containing 2 columns, one for gene expression called count and another for gene name called gene, this is the code I’m using to fit the distributional regression model:

model_priors <- c(brms::set_prior("normal(0, 2)", class = "Intercept", dpar = "mu"),
                  brms::set_prior("student_t(3, 0, 2)", class = "sd", dpar = "mu"),
                  brms::set_prior("normal(0, 2)", class = "Intercept", dpar = "shape"),
                  brms::set_prior("student_t(3, 0, 2)", class = "sd", dpar = "shape"))
model_formula <- brms::bf(count ~ 1 + (1 | gene), shape ~ 1 + (1 | gene))
brms_fit <- brms::brm(model_formula,
                      prior = model_priors, 
                      data = expr_df,
                      family = brms::negbinomial(link = "log", link_shape = "log"),
                      chains = 3L,
                      iter = 1000L,
                      warmup = 250L,
                      thin = 5L,
                      cores = 3L,
                      threads = 2L,
                      normalize = FALSE, 
                      silent = 2,
                      backend = "cmdstanr",
                      algorithm = "meanfield",
                      stan_model_args = list(stanc_options = list("O1")), 
                      seed = 312)

Some things I have tried:

Switching the spread of the priors to be heavier-tailed (did not work)
Using the pathfinder and fullrank algorithms instead of meanfield (both took way, way longer and so aren’t really useable)
Different random seeds (this was a bit of a hail mary I suppose)
Increasing the iterations per-chain (only made things slower)

Any help and / or ideas would be very well-appreciated - this is the first time I’ve gone deeper into Bayesian modeling and so I assume there’s something I’m missing.

Thanks!
-Jack

Corey.Plate · December 12, 2024, 5:00pm

Have you looked into / visualized how your real data differs from the simulated data? If it’s recovering parameters well in simulation, it may indicate that the simulated data does not match the real data well, which could just indicate that the model is misspecified or that there are preprocessing steps that need to be taken before putting the data through the model. This is your first step I think, and can offer clues about what features of the real data are breaking the model.

Something else to consider is that once parameters are sufficiently in the tails of certain distributions, the differences between, say, p ~ 5, 10, 100, 1000, etc. become negligible in terms of their probabilities, so the model cannot, without millions or billions of data points distinguish them meaningfully.

jr-leary7 · December 17, 2024, 6:02pm

so i believe i figured out the real issue - i was performing subsampling per-gene (i.e., selecting around 500 cells to perform the modeling for each gene), which led to some genes having all or nearly all values being 0, hence the issues with modeling. i fixed this by performing the subsampling by quintiles of gene expression, so that at least some non-zero values are selected for every gene.

also, in this case there are millions of data points, so despite the sparsity of the data i believe the model is now performing well (or, at least, the estimates i’m now getting are much more reasonable).

Topic		Replies	Views
Spatial hierachical negative-binomial model formulation Modeling cmdstanr , brms	6	76	December 23, 2024
Generating unrealistically high draws out of in posterior in zero-inflated negative binomial model Modeling specification	12	1908	April 30, 2018
Zero-inflated negative binomial model with random-effects brms	4	4722	May 27, 2019
Mathematical notation for a zero-inflated negative binomial nested model Modeling math , brms	0	322	January 24, 2024
Differential expression with proteomics Modeling brms	3	651	February 9, 2022

Wildly inflated mean values in negative-binomial hierarchical model

Related topics