More flexible than Dirichlet-multinomial: gamma-multinomial (?)

techniques

#1

In implementing a dirichlet_multinomial model (with 1K - 20K categories)

model{
   ...
   y[n] ~ dirichlet_multinomial( precision * softmax( X[n] * beta ) )
}

I have noticed that there is not a single value for precision that fits well all the data. That is, the abundant categories have much variability that explained by the model, and viceversa (the rare categories have less variability). It seems that the gene wise overdispersion is quite more complex than a single number (this is the reason why people model sequencing data with NB rather than multinomial, even though multinomial is the real final process that gives us the data)

This leads to:

  • if precision is a parameter, the most abundant categories push the precision down, so regression on rare categories with positive slopes are clearly missed. (red dots observed, lines are the generated quantities)

  • if I set a higher precision, the abundant categories include false positive regressions, and the variability is obviously underestimated. (red dots observed, lines are the generated quantities)

I think I need a more flexible implementation where I have to set either:

  • a precition/overdispersion proportional to the expected value, or
  • a precition/overdispersio independent for every category

Something like

~ dirichlet_like_multinomial( vector[N] precision ,  vector[N] simplex);

Where precision is either a function of the proportion (of the simplex), or independent values, that are simply correlated with proportion values.

(I hope it is not too crazy)

The generative model I am thinking

biology ->
gene_expression_expected_value(gene 1…G, person 1) ->
NB_2(…, overdispersion) ->
softmax( all genes ) ->
multinomial() ->
observed_counts_person_1

and a regression is built over many persons each caracterised by a value for a covariate of interest

with overdispersion = (or ~) function(gene_expression_expected_value)

In which overdispersion depends on the value of gene_expression_expected_value


Non-centred parameterisation for Dirichlet distribution
#2

Analogously to what dirichlet-multinomial is (I am using the conjugate more efficient implementation, the following implementation is just for discussion)

dirichlet-multinomial


paramters { 
  simplex[K] theta[N];  // addition of many intermediate parameters
  matrix[K, R] beta;
  real<lower=0> xi; // precision drichlet
  ... 
}
model { 
   alpha = X *  beta;
   theta[n] ~ dirichlet( xi * softmax( alpha[n] ) ); 
   y[n] ~ multinomial(theta[n]); 
   ... 
}

I am trying to get my head around a gamma-multinomial alternative. First having a unique overdispersion parameter for replicating the dirichlet-multinomial first

gamma-multinomial


paramters { 
  vector[K] theta[N];  // addition of many intermediate parameters
  matrix[K, R] beta;
  real<lower=0> overdispersion;
  ... 
}
model { 
   alpha = X *  beta;
   theta[n] ~ gamma_mean_overdispersion( exp ( alpha[n] ), overdispersion);  // eventually overdspersion will be a function of the mean
   y[n] ~ multinomial( softmax( log( theta[n] ) ) );  // I will try to avoid all this exp-lop back and forth -> aaronjg, post:7, topic:4358
   ... 

}

My questions are:

  • Is this in principle a viable implementation?
  • Is the huge matrix[K, N] theta necessarily, or we can express theta as tranformed parameter, of a parameter of size [K, R] (like beta is). How is possible that theta is much bigger than beta, which really regulates all the regressions? For a full study K = 20000 and N = 200.

Non-centred parameterisation for Dirichlet distribution
#3

I don’t really follow what are the mechanisms/features of your data that you are trying to model with either the Dirichlet-multinomial, Negative-bionimial-softmax-multinomial or Gamma-multinomial models. I think it would be hard to make progress without sorting this out. In particular, I remain unconvinced that multinomial likelihood represents well what is going on in RNA-seq, but I am open to be proven wrong.

Here is how I understand the generative process behind RNA-seq data (please point out mistakes if you see them, I am not a biologist):

Processes in cells:

  1. When a cell is in an ideal steady state, the mRNA count of any gene is Poisson-distributed (Poisson follows immediately from the assumption that the probability of degrading an mRNA molecule is the same as probability of synthesising one).
  2. At single-cell level, the above assumption breaks and more detailed models that account for transcriptional bursting are needed (may result in multimodal distributions). Some nice discussion here: http://www.ncbi.nlm.nih.gov/pubmed/22499939
  3. In bulk RNA-seq we average over many cells, which reduces the effect of single-cell fluctuations (and thus should make the distribution appear close to Poisson), but the cells in question a) may not be in steady state and b) may not be in the same state -> this creates over dispersion and Gamma-Poission (Negative binomial) fits the data better.
  4. When using a linear model to predict expression, we are likely to not capture all the predictors that are actually important so even if the cell population was ideal (and thus mRNA counts were Poisson distributed for hypothethical replicates of each gene and sample individually) we would get some residual error -> overdispersion -> negative binomial

So negative binomial seems to be a good model for mRNA counts in cells (before we sequence). There are a bunch of processes that I ignored (e.g. alternative splicing) but I don’t think they change the picture greatly, it just gets more messy.

Sequencing:

  1. The cells are lysed and mRNA is captured, fragmented and the fragments are reverse transcribed to cDNA - this is messy and lossy, but if we assume that given mRNA breaks into a fixed number of fragments and a fixed proportion of fragments is turned into cDNA (the numbers may differ across genes), the distribution remains neg. binom. Note that in general we cannot model the fates of individual fragments as there are many ways the mRNA can be fragmented and the length and GC content of the individual fragments will influence their behavior in following steps. AFAIK everyone assumes that all the fragments are exchangeable and I will too.
  2. The cDNA is amplified in PCR (roughly: in each PCR cycle, most cDNA molecules give raise to a single additional copy, a few dozen cycles are run) - this is extremely hard to model for small counts of molecules - e.g. if there is only a single molecule for a given gene, it becomes vital whether it is duplicated in the first PCR cycle or not, resulting in multimodal distribution after PCR. Luckily, as the number of PCR cycles and initial molecule counts grows, the PCR step becomes more predictable and Gaussian-ish. AFAIK everyone ignores this when modelling RNA-seq and just hopes the distribution remains neg. binom and all of PCR flukes are covered with overdispersion.
  3. The cDNA is loaded into the sequencer where some of it is sequenced, depending on sequencing depth and other things. Of the sequenced fragments some are succesfully mapped to the genes of origin, some not. Once again it does not seem unreasonable to assume that for each gene a fixed proportion of fragments survives this process, keeping the distribution neg. binom.

Throughout this discussion neg. binom seems to be a reasonable choice - in particular because of the natural overdispersion it provides. Could you explain why you assume the proposed alternatives would be a better model and which particular steps of the process are you trying to account for with them? Or are there hints in the data the neg. binom is not a good fit?

IMHO the multinomial model is appropriate for step 7 at best, but then it would need to be multinomial model over neg. binom initial counts. I know that neg.binom -> binomial is still neg. binom, though I am not sure what multinomial distribution will do here. I would however guess that the negative correlations induced by multinomial (e.g. that sequencing a lot of a single gene means less copies of other genes are sequenced) is negligible, since the proportion of each gene among the total reads is small (unless you made some mistake and have 90% of your reads from the ribosome or something). But maybe there is an analytic solution to neg. binom -> multinomial?

Also the multinomial behavior holds only when each sample is in its own lane, but you may have multiple samples in the same lane (distinguished by barcode sequences) resulting in a more complex correlations across gene counts.


#4

Yes I agree this is probably the best simple model on what is going on at the tissue/cell level.

Exactly, I think you got the point in the logics I am using. the step 7 is analogous to fish 1 million balls from a urn that contains 1 billion balls of 20K colors. Each sample depending if was sequenced deep or shallow (or depending on the molecule degradation) will extract a different number. And across many urns (patients) the number of balls of each color has been produced with a negative binomial process. So here it comes negative binomial -> multinomial

Yes I am trying to do gamma -> multinomial (stan does not have integer parameters).

That’s a good point (@Bob_Carpenter ) however the multinomial will still model the uncertainty of tubes that have been sequenced deeply (more extractions from the urn) versus tubes that have little number of reads (carry less information, and the data have less weight). I see however that this does at a cost of complicating the model (from a simple dirichlet-multinomial).

I wish, however I don’t think so. At the moment I am implementing

This should be negligible, depending on the case in mixing the lanes. etc. However, better to start from the main processes, and build up.


#5

But this can also be modelled in the neg. binom. framework by scaling the mean, e.g.:

Y[i,j] ~ neg_binomial_2(c[i,j] * beta[i,] * X[,j], phi)

where c[i,j] is the sample and gene specific normalization factor that is passed as data. There are tools to calculate those from RNA-seq data (some only have sample specific, others also have gene specific taking into account GC-bias or similar stuff). I have never really used those, but the DESeq2 paper has some discussion of this approach. The scaling also models the different information content of each sample as lowering the mean of a neg. binom results in lower signal to noise ratio. Why would that be inappropriate in your case?


#6

It might model indirectly the same two NB -> multinomial generative process. I would like to involve @Bob_Carpenter in this conversation as he can have a better idea.