Wildly inflated mean values in negative-binomial hierarchical model

Hi all,

I’m currently working on an R package that models the mean (\mu) and overdispersion (\theta) of single cell gene expression (a strictly non-negative, overdispersed count) using a Negative-binomial hierarchical model with random effects per-gene. I’m using variational inference through the meanfield algorithm via brms with the cmdstanr backend to fit the model. The model worked well on simulated data i.e., the posterior estimates were relatively close to Frequentist estimates of the same statistics, and I was able to generate credible intervals for the mean and variance of each gene’s expression after fitting the model. However, when applying it to a real dataset I’ve observed wildly over-inflated estimates of the mean expression for each gene on the order of e.g., 5x10^55 as opposed to the “true” mean of ~4. The basic model and priors I’m using are as follows; y_{ig} is the expression of gene g in cell i, and the subscripts 0 and g indicate a global intercept and a gene-specific random effect, respectively:

\begin{aligned} y_{ig} &\sim \text{NB}(\mu_{ig}, \theta_g) \\ \log\left(\mathbb{E}[\mu_g]\right) &= \beta_0 + \gamma_g \\ \beta_0 &\sim \text{Gaussian}(0, 2) \\ \gamma_g &\sim t(3, 0, 2) \\ \log\left(\mathbb{E}[\theta_g]\right) &= \alpha_0 + \kappa_g \\ \alpha_0 &\sim \text{Gaussian}(0, 2) \\ \kappa_g &\sim t(3, 0, 2) \\ \end{aligned}

Assuming we have a dataframe called expr_df containing 2 columns, one for gene expression called count and another for gene name called gene, this is the code I’m using to fit the distributional regression model:

model_priors <- c(brms::set_prior("normal(0, 2)", class = "Intercept", dpar = "mu"),
                  brms::set_prior("student_t(3, 0, 2)", class = "sd", dpar = "mu"),
                  brms::set_prior("normal(0, 2)", class = "Intercept", dpar = "shape"),
                  brms::set_prior("student_t(3, 0, 2)", class = "sd", dpar = "shape"))
model_formula <- brms::bf(count ~ 1 + (1 | gene), shape ~ 1 + (1 | gene))
brms_fit <- brms::brm(model_formula,
                      prior = model_priors, 
                      data = expr_df,
                      family = brms::negbinomial(link = "log", link_shape = "log"),
                      chains = 3L,
                      iter = 1000L,
                      warmup = 250L,
                      thin = 5L,
                      cores = 3L,
                      threads = 2L,
                      normalize = FALSE, 
                      silent = 2,
                      backend = "cmdstanr",
                      algorithm = "meanfield",
                      stan_model_args = list(stanc_options = list("O1")), 
                      seed = 312)

Some things I have tried:

  1. Switching the spread of the priors to be heavier-tailed (did not work)
  2. Using the pathfinder and fullrank algorithms instead of meanfield (both took way, way longer and so aren’t really useable)
  3. Different random seeds (this was a bit of a hail mary I suppose)
  4. Increasing the iterations per-chain (only made things slower)

Any help and / or ideas would be very well-appreciated - this is the first time I’ve gone deeper into Bayesian modeling and so I assume there’s something I’m missing.

Thanks!
-Jack

Have you looked into / visualized how your real data differs from the simulated data? If it’s recovering parameters well in simulation, it may indicate that the simulated data does not match the real data well, which could just indicate that the model is misspecified or that there are preprocessing steps that need to be taken before putting the data through the model. This is your first step I think, and can offer clues about what features of the real data are breaking the model.

Something else to consider is that once parameters are sufficiently in the tails of certain distributions, the differences between, say, p ~ 5, 10, 100, 1000, etc. become negligible in terms of their probabilities, so the model cannot, without millions or billions of data points distinguish them meaningfully.

so i believe i figured out the real issue - i was performing subsampling per-gene (i.e., selecting around 500 cells to perform the modeling for each gene), which led to some genes having all or nearly all values being 0, hence the issues with modeling. i fixed this by performing the subsampling by quintiles of gene expression, so that at least some non-zero values are selected for every gene.

also, in this case there are millions of data points, so despite the sparsity of the data i believe the model is now performing well (or, at least, the estimates i’m now getting are much more reasonable).