More flexible than Dirichlet-multinomial: gamma-multinomial (?)



In implementing a dirichlet_multinomial model (with 1K - 20K categories)

   y[n] ~ dirichlet_multinomial( precision * softmax( X[n] * beta ) )

I have noticed that there is not a single value for precision that fits well all the data. That is, the abundant categories have much variability that explained by the model, and viceversa (the rare categories have less variability). It seems that the gene wise overdispersion is quite more complex than a single number (this is the reason why people model sequencing data with NB rather than multinomial, even though multinomial is the real final process that gives us the data)

This leads to:

  • if precision is a parameter, the most abundant categories push the precision down, so regression on rare categories with positive slopes are clearly missed. (red dots observed, lines are the generated quantities)

  • if I set a higher precision, the abundant categories include false positive regressions, and the variability is obviously underestimated. (red dots observed, lines are the generated quantities)

I think I need a more flexible implementation where I have to set either:

  • a precition/overdispersion proportional to the expected value, or
  • a precition/overdispersio independent for every category

Something like

~ dirichlet_like_multinomial( vector[N] precision ,  vector[N] simplex);

Where precision is either a function of the proportion (of the simplex), or independent values, that are simply correlated with proportion values.

(I hope it is not too crazy)

The generative model I am thinking

biology ->
gene_expression_expected_value(gene 1…G, person 1) ->
NB_2(…, overdispersion) ->
softmax( all genes ) ->
multinomial() ->

and a regression is built over many persons each caracterised by a value for a covariate of interest

with overdispersion = (or ~) function(gene_expression_expected_value)

In which overdispersion depends on the value of gene_expression_expected_value

Non-centred parameterisation for Dirichlet distribution

Analogously to what dirichlet-multinomial is (I am using the conjugate more efficient implementation, the following implementation is just for discussion)


paramters { 
  simplex[K] theta[N];  // addition of many intermediate parameters
  matrix[K, R] beta;
  real<lower=0> xi; // precision drichlet
model { 
   alpha = X *  beta;
   theta[n] ~ dirichlet( xi * softmax( alpha[n] ) ); 
   y[n] ~ multinomial(theta[n]); 

I am trying to get my head around a gamma-multinomial alternative. First having a unique overdispersion parameter for replicating the dirichlet-multinomial first


paramters { 
  vector[K] theta[N];  // addition of many intermediate parameters
  matrix[K, R] beta;
  real<lower=0> overdispersion;
model { 
   alpha = X *  beta;
   theta[n] ~ gamma_mean_overdispersion( exp ( alpha[n] ), overdispersion);  // eventually overdspersion will be a function of the mean
   y[n] ~ multinomial( softmax( log( theta[n] ) ) );  // I will try to avoid all this exp-lop back and forth -> aaronjg, post:7, topic:4358


My questions are:

  • Is this in principle a viable implementation?
  • Is the huge matrix[K, N] theta necessarily, or we can express theta as tranformed parameter, of a parameter of size [K, R] (like beta is). How is possible that theta is much bigger than beta, which really regulates all the regressions? For a full study K = 20000 and N = 200.

Non-centred parameterisation for Dirichlet distribution

I don’t really follow what are the mechanisms/features of your data that you are trying to model with either the Dirichlet-multinomial, Negative-bionimial-softmax-multinomial or Gamma-multinomial models. I think it would be hard to make progress without sorting this out. In particular, I remain unconvinced that multinomial likelihood represents well what is going on in RNA-seq, but I am open to be proven wrong.

Here is how I understand the generative process behind RNA-seq data (please point out mistakes if you see them, I am not a biologist):

Processes in cells:

  1. When a cell is in an ideal steady state, the mRNA count of any gene is Poisson-distributed (Poisson follows immediately from the assumption that the probability of degrading an mRNA molecule is the same as probability of synthesising one).
  2. At single-cell level, the above assumption breaks and more detailed models that account for transcriptional bursting are needed (may result in multimodal distributions). Some nice discussion here:
  3. In bulk RNA-seq we average over many cells, which reduces the effect of single-cell fluctuations (and thus should make the distribution appear close to Poisson), but the cells in question a) may not be in steady state and b) may not be in the same state -> this creates over dispersion and Gamma-Poission (Negative binomial) fits the data better.
  4. When using a linear model to predict expression, we are likely to not capture all the predictors that are actually important so even if the cell population was ideal (and thus mRNA counts were Poisson distributed for hypothethical replicates of each gene and sample individually) we would get some residual error -> overdispersion -> negative binomial

So negative binomial seems to be a good model for mRNA counts in cells (before we sequence). There are a bunch of processes that I ignored (e.g. alternative splicing) but I don’t think they change the picture greatly, it just gets more messy.


  1. The cells are lysed and mRNA is captured, fragmented and the fragments are reverse transcribed to cDNA - this is messy and lossy, but if we assume that given mRNA breaks into a fixed number of fragments and a fixed proportion of fragments is turned into cDNA (the numbers may differ across genes), the distribution remains neg. binom. Note that in general we cannot model the fates of individual fragments as there are many ways the mRNA can be fragmented and the length and GC content of the individual fragments will influence their behavior in following steps. AFAIK everyone assumes that all the fragments are exchangeable and I will too.
  2. The cDNA is amplified in PCR (roughly: in each PCR cycle, most cDNA molecules give raise to a single additional copy, a few dozen cycles are run) - this is extremely hard to model for small counts of molecules - e.g. if there is only a single molecule for a given gene, it becomes vital whether it is duplicated in the first PCR cycle or not, resulting in multimodal distribution after PCR. Luckily, as the number of PCR cycles and initial molecule counts grows, the PCR step becomes more predictable and Gaussian-ish. AFAIK everyone ignores this when modelling RNA-seq and just hopes the distribution remains neg. binom and all of PCR flukes are covered with overdispersion.
  3. The cDNA is loaded into the sequencer where some of it is sequenced, depending on sequencing depth and other things. Of the sequenced fragments some are succesfully mapped to the genes of origin, some not. Once again it does not seem unreasonable to assume that for each gene a fixed proportion of fragments survives this process, keeping the distribution neg. binom.

Throughout this discussion neg. binom seems to be a reasonable choice - in particular because of the natural overdispersion it provides. Could you explain why you assume the proposed alternatives would be a better model and which particular steps of the process are you trying to account for with them? Or are there hints in the data the neg. binom is not a good fit?

IMHO the multinomial model is appropriate for step 7 at best, but then it would need to be multinomial model over neg. binom initial counts. I know that neg.binom -> binomial is still neg. binom, though I am not sure what multinomial distribution will do here. I would however guess that the negative correlations induced by multinomial (e.g. that sequencing a lot of a single gene means less copies of other genes are sequenced) is negligible, since the proportion of each gene among the total reads is small (unless you made some mistake and have 90% of your reads from the ribosome or something). But maybe there is an analytic solution to neg. binom -> multinomial?

Also the multinomial behavior holds only when each sample is in its own lane, but you may have multiple samples in the same lane (distinguished by barcode sequences) resulting in a more complex correlations across gene counts.


Yes I agree this is probably the best simple model on what is going on at the tissue/cell level.

Exactly, I think you got the point in the logics I am using. the step 7 is analogous to fish 1 million balls from a urn that contains 1 billion balls of 20K colors. Each sample depending if was sequenced deep or shallow (or depending on the molecule degradation) will extract a different number. And across many urns (patients) the number of balls of each color has been produced with a negative binomial process. So here it comes negative binomial -> multinomial

Yes I am trying to do gamma -> multinomial (stan does not have integer parameters).

That’s a good point (@Bob_Carpenter ) however the multinomial will still model the uncertainty of tubes that have been sequenced deeply (more extractions from the urn) versus tubes that have little number of reads (carry less information, and the data have less weight). I see however that this does at a cost of complicating the model (from a simple dirichlet-multinomial).

I wish, however I don’t think so. At the moment I am implementing

This should be negligible, depending on the case in mixing the lanes. etc. However, better to start from the main processes, and build up.


But this can also be modelled in the neg. binom. framework by scaling the mean, e.g.:

Y[i,j] ~ neg_binomial_2(c[i,j] * beta[i,] * X[,j], phi)

where c[i,j] is the sample and gene specific normalization factor that is passed as data. There are tools to calculate those from RNA-seq data (some only have sample specific, others also have gene specific taking into account GC-bias or similar stuff). I have never really used those, but the DESeq2 paper has some discussion of this approach. The scaling also models the different information content of each sample as lowering the mean of a neg. binom results in lower signal to noise ratio. Why would that be inappropriate in your case?


It might model indirectly the same two NB -> multinomial generative process. I would like to involve @Bob_Carpenter in this conversation as he can have a better idea.


I don’t follow this. We take a finite sample and then keep chopping it up. Never do we have an open-ended count distribution like Poisson or negative binomial. I can swee why people like independent Poissons or negative binomial because it’s simpler. But it doesn’t really normalize properly.

The negative binomial is usually applied as a hack when the explantory covariates aren’t sufficient to describe the data. In general, this results because there’s unexplained overdispersion w.r.t. some kind of model. But that’s not particularly generative in the biological process sense as there’s nothing to account for the overdispersion (such as latent states of cells).

What’s a lane and what are bar codes?


This is actually a quote from @martinmodrak:

This is why you can often get by with a Poisson approximation to multinomials without much difference to the model.


When you sequence, some samples/tubes can be put in external lanes of the sequencing chip, and this can affect sequencing thus be a confounding factor (I guess this technical artefact depends on the technology used). This however depends on the sequencing strategy (e.g., some technician might decide to mix all samples and distribute them equally across the chip(s)). In this sense there are is an amount of variation that is not easily explicable (with other 20+ steps/reaction that from the tissue bring to sequencing). Of course nobody has done I think a proper analysis of this variation in Bayesian terms.

(That’s why is so much easier to include all unwanted variation within the same bucket)

Bar code is a unique artificial sequence (one kind per sample/tube) that you attach to every molecule of a sample/tube so you can mix all the samples/tubes together for sequencing and still link sequence-tube uniquely, this lowering the cost of sequencing (and lowering the confounding factors due to multiple runs of sequencing).

P.S., I am now thinking whether it is better generally speaking:

  • randomising, or
  • keeping the objects of the analysis separate and add covariates in the analysis accounting for possible biases


I think you are overestimating the amount of order to be found in biology :-) But I’ll be happy to be proven wrong. Please keep in mind though that what follows is a computer scientists’s understanding of complex biology, so there is grain of salt to be taken.

The claim I was trying to make is that if you had a magical device that exactly identifies all mRNA molecules in your sample and you did exact replications of the same experiment (e.g. by taking multiple samples at the same time from a single homogenous cell culture grown in exactly controlled conditions, synchronized at the same phase of the cell cycle etc.) the marginal distribution of mRNA counts for most genes across replicates will be overdispersed Poisson. This is a consequence of the points 1-3 in my first post in this thread.

I am not sure if you are disputing any of this or not. I should however note that there is still some discussion on the actual shapes of the distributions, also coming down to whether you should analyze the raw read counts or normalized expression levels (e.g. this recent paper using 20-27 replicates per condition to claim Gaussian - Power law mixture for normalized expression levels).

I agree that there is some appeal to model the sample preparation and sequencing steps as multinomial, but the substantial pre-sequencing variability has to be taken into account. In addition, the PCR step (see step 6 in my post above) may substantially inflate variability of lowly expressed genes. Also the mixing of samples across lanes (physical compartments of the sequencing machine) - which @stemangiola explained better than I would - means that your reads might not be draws from a single bucket of multicolored balls but a sum of draws from multiple buckets where also other samples were “competing” for sequencing capacity. The details depend on the actual protocol the data in question was collected with, but multinomial is likely an over-simplification.

I am by no means knowledgeable about normalization procedures for RNA-seq, but AFAIK the normalization constants usually take into account more information than just the total number of molecules sequenced (which is what the multinomial can handle). I’ve heard of per-sample normalization via median of ratios and per-gene normalization based on GC/AT ratio in the transcript sequence and transcript length. This suggests (but does not prove) that there are reasons to avoid normalizing simply by total read count.

If I understand your points correctly, the case for multinomial against separate binomial for each gene is that multinomial introduces global correlation structure. Note however that the underlying biology also provides a global correlation structure, which is largely unknown (uncovering it would IMHO be Nobel prize worthy), and its effects are IMHO larger than what I would expect from the multinomial.


I’m sure that’s just an approximation, too. My point was just that when you look at your steps 1–3, they’re giving you a mechanism with which to model the overdispersion directly. As soon as you do something like a Poisson GLM, the results are “overdispersed” compared to a naive intercept-only model.

The nice part about Bayesian modeling with a flexible tool like Stan is that we don’t need to make these Gaussian approximations.

I’m not at all trying to say sample prep and sequencing are multinomial end-to-end. I’m just saying that the sequencing is multinomial conditioned on the sample you put in the sequencer. What I’d like to see is a model of how you get to that sample, which is what I think your steps are outlining. Then we don’t just have to say “overdispersed”, we can point to things like amplifcation steps and differential binding and censoring in the alignments due to copies, etc. etc.

All the ones I’ve ever seen are terribly ad hoc. That’s why something that accounted for the actual process should be better. I think there’s a big opportunity here to start modeling.

Isn’t regulation the way people are trying to study “global correlation structure”? In addition to the more local correlation structure you get from chromatin folding (and which caused Mendel to fudge his data).


This would be something different. You’d have a large bucket of sequences and some kind of selection procedure for getting through. But again my point is to model that competition if possible rather than just throwing in extra dispersion.


@Bob_Carpenter I took that opportunity and built a logNormal-multinomal model that works. Now (differently from the less-generative NB model) it has bad mixing with default control parameters.

  • logNormal-multinomial, good inference, stringent control - 2000 seconds
  • NB, good inference, default control - 160 seconds

Now I need a bit of help in making the multinomial model more stable.

A first question is: is there a way to decrease the dimensionality of the intermediate parameter? In

    int<lower = 0> y[T, G];             // RNA-seq counts

    matrix[R, G] beta;
    vector[G] alpha[T];

	// Likelihood
	for(t in 1:T) alpha[t] ~ normal(X[t] * beta , overdispersion);
	for(t in 1:T) y[t] ~ multinomial( softmax( alpha[t] ) );

if beta has dimension [R,G], why I have to use an intermediate parameter alpha_gamma that has dimension [T, G]?

  • R = 2 covariate
  • T = 20 tubes

P.S. I tried to put the upstream normal prior only on beta0 intercept

transformed parameters{
alpha = X * beta_upstream;
	// Likelihood

	beta_upstream[1] ~ normal( beta[1] , overdispersion);
	for(t in 1:T) y[t] ~ multinomial( softmax( alpha[t] ) );

but it did not work, as overdispersion stays at 0


There isn’t an alpha_gamma parameter. There is a T x G parameter alpha.

Yes, you need to draw that if you want to handle the model this way.

Is overdispersion a vector?

What kind of prior are you putting on beta? Because otherwise alpha is not going to be identified.

Something like the Dirichlet-multinomial has an analytic marginalization.

theta ~ dirichleta(alpha);
y ~ mulitnomial(theta);

reduces to

y ~ dirichlet_multinomial(alpha);

Stan only has the beta-binomial, but the Dirichlet-mulitnomial is similar.


Sorry I meant “alpha”

No, a real at the moment. But this setup in theory allows more dynamic noise models

to the slope I am putting reg_horseshoe, the intercept I am using

~ normal(0,5);
sum(..) ~ normal(0, 0.01 * G)

Does it sound right? I was thnking of constraining alpha to sum to zero (per tube) as well, but it doesn’t sounds right to constrain both the sum of alpha and intercept to 0. (should I?)

This is how I open this thread with :) . The noise model of Dirichlet seem to be too stringent for high value components (highly transcribed genes). Moreover would be nice to be able to expand this in future (if I can make it work) with gene wise “overdispersion” and maybe correlated genes.

But the use of alpha intermediate seem to complicate everything. My best bet is that I am use sum-to-zero constrain in the wrong way on the wrong parameters (on alpha or on beta?)


There is usually a non-zero intercept along with sum-to-zero random effects.

Just implement it as a user-defined distribution. The pmf only involves gamma functions (and factorials, which can be implemented with gamma):


Thanks, that’s correct surely for sequenced read counts, that have a left tail. Because of this I implemented from the code of @aaronjg a gamma_log density function that in my opinion suits my “sum-to-zero random effects”

real gamma_log_lpdf(vector x_log, real a, real b){

				vector[rows(x_log)] jacob = x_log; //jacobian
				real norm_constant = a * log(b) -lgamma(a);
				real a_minus_1 = a-1;
				return sum( jacob ) + norm_constant * rows(x_log) + sum(  x_log * a_minus_1 - exp(x_log) * b ) ;


I hope will be of use to someone.


I spent a bit of energy trying to understand the connection between negative binomial, multinomial and Dirichlet multinomial. Here is what I learned in hope it will be useful to somebody in the distant future. I am mainly building on sections 2 to 3.1 of GuimarĂŁes & Lindrooth (2007): Controlling for overdispersion in grouped conditional logit models: A computationally simple application of Dirichlet-multinomial regression - it is unfortunately paywalled, but I was not able to find a non-paywalled reference. I hope I got all the math right.

GuimarĂŁes & Lindrooth show that the connection between Dirichlet-Multinomial (DM) and Negative Binomial (NB) is the same as between Multinomial and Poisson. In particular, DM can be derived starting with a base NB regression and conditioning on the sum of counts observed per group. This however only works for a certain parametrization of the NB.

More formally, let us assume g \in {1, ... , G} correspond to genes and s \in {1,...S} are samples and we have latent NB-distributed variables z_{g,s} and predictor values \lambda_{g,s}:

z_{g,s} \sim NB(\lambda_{g,s}, \theta_s)

But, importantly the NB is parametrized so that:

E(z_{g,s}) = \lambda_{g,s}
Var(z_{g,s}) = \lambda_{g,s} ( 1 + \theta_s)

Which is different from the NB_2 parametrization in Stan, which would assume Var(z_{g,s}) = \lambda_{g,s} ( 1 + \frac{\lambda_{g,s}}{\phi_s}) This difference will become important later on. Also note that \theta is allowed to vary by sample, but not allowed to vary by gene. For biological applications it makes IMHO little sense to vary \theta by sample, but I keep this for completeness.

Here \lambda_{g,s} is your usual predictor. For a linear model, you’d have \lambda_{g,s} = exp(\alpha_s + \beta_g X_s) where \alpha_s represents all influences that don’t change within sample.

The next step in the reasoning is one I am least sure I understand correctly - the papers says “conditioning on the sum of counts for each group”, which I think can be translated as follows:

When we observe n_s transcripts randomly sampled from all the z_{g,s}, i.e.

(x_{1,s},...,x_{G,s}) \sim Multinomial(n_s, \frac{1}{\bar{z}_s} (z_{1,s}, ... , z_{G,s}) )

where \bar{z}_s = \sum_{g=1}^{G}{z_{g,s}}

the distribution of x_{g,s} becomes:

(x_{1,s}, ..., x_{G,s}) \sim DM(\delta_s^{-1} \tilde{\lambda}_{1,s}, ...., \delta_s^{-1} \tilde{\lambda}_{G,s})

where \tilde{\lambda}_{g,s} = \frac{\lambda_{g,s}}{exp(\alpha_s)} = \beta_g X_s

and \delta_s^{-1} = \theta_s^{-1} exp (\alpha_s).

This means that we cannot distinguish \theta_s from any constant per-sample influence. This is why we need softmax in the DM formulation in the original post of this topic - it avoids this non-identifiability.

The reason this model either overestimates the variance of lowly expressed genes or underestimates the variance of highly expressed genes is derived from the NB-parametrization used. When you hold \phi constant (as is easily done with the NB_2 parametrization in Stan), the variance of the NB variables is assumed to increase with \lambda_{g,s}^2 which tends to fit RNA-seq data nicely. While when you hold \theta constant (which is required by the DM model), the variance increases only with \lambda_{g,s}.

If you want to read the GuimarĂŁes & Lindrooth paper, it uses different terminology, but the mapping is as follows:

  • individuals (also called decision makers), indexed by i - those would be individual transcripts
  • groups, indexed by g - those would be samples/tubes
  • decisions, indexed by j - those would be genes