Trying to generalise the dirichlet-multinomial (non-analytical) framework (replacing Dirichlet with other distributions)

With attempting to get a more robust noise model for RNA sequencing data I am trying to build a hierarchical model based on multinomial.

Since the explicit formulation of dirichlet-multinomial is of the form

parameters {
vector[G] theta[N];
vector[G] lambda;
real sigma;
sum(lambda) ~ normal(0,0.001 * G); // Parameter of interest; // Soft constrain sum to 0
theta[n] ~dirichlet( softmax(lambda) * sigma); 
counts[n] ~ multinomial( theta[n])

I am trying to generalising first to normal-multinomial (with the final goal of using the robust student-t finally building an improved robust noise model, maybe)

parameters {
vector[G] theta[N];
vector[G] lambda;
real sigma;
sum(lambda) ~ normal(0,0.001 * G); // Parameter of interest  // Soft constrain sum to 0
theta[n] ~ normal( lambda, sigma );
//sum(theta[n]) ~ normal(0,0.001 * G); // is this needed? // Soft constrain sum to 0
counts[n] ~ multinomial( softmax( theta[n] ) )

However i tend to get poor mixing, with high tree depth needed mainly because, I have correlation between theta and lambda (in the picture I use non-centered form for theta)

The questions are:

  • why the dirichlet-multinomial is well defined while the normal-multinomial is not?
  • is this route feasible or there is a unsolvable indeterminability?

P.S. G (the number of components of the system, genes, is a big number on the thousands)

The Dirichlet - multinomial is a multivariate extension of the beta- binomial distribution.
This allows the noise get absorbed by the dirichlet-distribution. Especially as model in above as measurement noise. The robust student-t is likely to be a model noise. Thus we talking about two
different things. It’s a matter of measurement variances and their heterogeneity, if the multinomial distribution alone is suitable.


can I ask you to expand on this?

Also here I am trying to interpret but expanding this would help a lot.

The noise model idea is in principle quite simple (not necessarily practical). The data is produced multinomially (there are many genes copies in a liquid solution and they get sequenced probing the content of that DNA solution). The latent rates of abundance can have a quite sparse noise, therefore:

  • the hierarchical Dirichlet noise model (after the multinomial distribution) does not have enough long tails to be robust against outliers when present, and it also does not model well the variance in the data forgetting about outliers.
  • a more convenient negative-binomial-only model is also not robust against outliers, does not have long enough tails

That’s why I thought about modelling first the multinomial process and revealing the noise model for the rates that can have fat tails.

Suppose a count distribution. You may model it with a Poisson distribution, when then variance approx. the expected value or a negative binomial distribution when the variance is (significantly) greater than the expected value. Recall the negative binomial can be seen as a Poisson gamma mixture.
But in both cases we are able to model the mu (=> lambda) as normal or student-t distributed.
Consider the dirichlet-multinomial distribution now as negative binomial distribution and the multinomial alone as Poisson distribution to get the picture. (I know it’s a extension of the beta binomial).
In case of the kalman filter, the choice is the distribution is the measurement noise and the distribution
of mu is the model noise. (In negative binomial model, its integrated out, in Dirichlet multinomial not)

If you have outliers or the variances in the categories then yes this may result in problems.
We have to identify is it just that we

a) have outliers
b) different variances in the categories
c) a) + b)

I myself integrated out some Poisson mixtures with exponential distribution / Sichel / Lindley /… which support long tails. Since the multinomial distribution can be model as Poisson distribution those distributions may be modeled in a multinomial Poisson “extended mixture” transformation as well.
(See SG BAKER 1994, the Multinomial Poisson transformation)


Just to make sure to avoid misunderstanding. I this settings for example

int counts[N, G];
parameters {
vector[G] lambda;
real sigma; // This could ideally be real sigma[G];
lambda ~ gamma_log(.., ..); // log(Gamma distribution)
counts[n] ~ neg_binomial_2_log( lambda, 1/sqrt(sigma) );

The case is

a) counts[g] for lambda[g] can sometimes be consistent with neg_binomial_2_log (grey and red here are two different groups in the data)


and sometimes can include “outliers” (gold standard method evaluate the difference between the following instance more significant than the previous instance)


b) We ideally have lambda-wise variances (sigma of dimension G) with the relation lambda-sigma that follows a known trend trend (sigmoid-like)


However, it is mainly point (a) that so far I could not improve

This could be the answer. This reading can be also of interest “Analysis of Long-Tailed Count Data by Poisson

I will read this with interest (even though the high concentration of formulae tells me it might be out of reach for me).

What do you think would be the advantage of modelling count data as multinomial (based models) rather than poisson (based models) if the number of components is high (G > 10.000)?

A frst example could be the difference in exposure between replicates (N) is automatically modelled for multinomial-like, while for poisson-like-models it has to be explicitly modelled with a parameter (exposure[N])

The choice is yours:

I coded a few of them, and also built some of my own. Why not try something common,
the PIG (Poisson inverse gamma), also in R, the lindsey Poisson, the Shanker Poisson,
Poisson Transmuted Exponential Family, and some more.
Just fit them against a intercept with optim in GNU R and choose the “best” one.
I can share some of them, but I just coded them up, to see how they perform in a case.

Well count data usually shares infinite number of counts, so a multinomial is not appropriate. However we
cannot have infinite number of counts in reality. So we could cut them off, either case we make an error.
There is a link between both, and sometimes its a question of efficiency to choose a Poisson transformation to model a multinomial distribution.

I see already a huge number of possibilities to model your problem. I usually start with most minimalist change to solve the biggest issue.

1 Like

Wow, there is a lot

Given that I am new to all mixed-poissons (except NB) I am happy to start with your suggestions

  • Poisson-inverse gamma
  • Poisson Transmuted Exponential Family
  • Poisson–Lindley
  • Poisson–Shanker

As I am not too familiar with optim from R (I am not sure (i) what optimisation function to use with these quite complex distributions, and (ii) where to find a comparative metric of fit) I may opt for Stan directly so I can generated quantities, and use LOO eventually.

I will try to code such custom distributions up (if you want to share some code would be great so I can speed up this tests).

Can you rephrase this? So I am able to understand better your later statements. I am happy to learn from you on this.

At really high level, is it fair to say that (sorry for the stretch)

  • multinomial ~ serie of poissons (no tails)
  • dirichlet-multinomial ~ serie of negative binomials with a unique scalar over-dispersion term (short tails)
  • (hypothetical) studentT-multinomial ~ series of Poisson-inverse gammas (or other poisson long mixtures)

I am glad that at least one of us do ;)

You might try looking into the Grifith/Engen/McCloskey distribution which arises from the Pitmman-Yor process, which is a generalization of the dirichlet multinomial. There are a number of special cases (3 at least I think, maybe more) that have closed forms and may work for you. It will take me a bit to dig up the references though…

Here is some more discussion on the relation between the logistic-normal and the Dirichlet multinomial distribution

1 Like
PIG <- function(y, mu, tau) {
  a <- sqrt(1 / tau^2 + 2*mu / tau)
  (2*a/pi)^0.5 * mu^y*exp(1/tau)*besselK(a,y-0.5)/((a*tau)^y*factorial(y))
PIG.pgf <- function(y, mu, tau) {
  return (2*tau*mu/(1+2*tau*mu) *
          (1-3/(2*y)) * PIG.pgf(y-1,mu,tau) +
           mu^2 / (1+2*tau*mu) / (y*(y-1)) * PIG.pgf(y-2,mu,tau));

PIG.pgf2 <- function(y, mu, tau) {

gamlss.dist::dPIG(r, exp(mup[i]), exp(mup[i])*beta+tau) 

real PIG(int y, real mu, real tau);

real PIG(int y, real mu, real tau) {
    return exp(1/tau*(1-(1+2*tau*mu)^0.5));
    return mu*(1+2*tau*mu)^-0.5*PIG(0,mu,tau);
  return 2.*tau*mu/(1+2*tau*mu) *
          (1. - 3. /(2.*y)) * PIG(y-1,mu,tau) +
           mu^2 / (1+2.*tau*mu) / (y*(y-1)) * PIG(y-2,mu,tau);

beta  ~ student_t(4, 0, 1);
tau ~ student_t(4, 0, 1);
target += log(PIG(y, exp(mup[i]), exp(mup[i])*beta+tau));

LinseyPoisson<- function(x,theta,a) gamma(x+a)*theta^(a+1)*(a+(x+a)/(1+theta))/(factorial(x)*gamma(a+1)*(1+theta)^(1+x+a))

LinseyPoisson_Log <- function(x,theta,a) (1 + a)*log(theta) - (1 + a + x)*log(1 + theta) + log(a + (a + x)/(1 + theta)) - 
  lgamma(x+1) - lgamma(1 + a) + lgamma(a + x)
la = exp(mup);  
alpha_inv ~ cauchy(0, 1);
real<lower=0> alpha = 1.0 / alpha_inv;

real LinseyPoisson_log(int x, real theta, real a) {
  return (1 + a)*log(theta) - (1 + a + x)*log(1 + theta) + log(a + (a + x)/(1 + theta)) - 
          lgamma(x+1) - lgamma(1 + a) + lgamma(a + x);
target += LinseyPoisson_log(y, 0.5/la[i]*(alpha+la[i]*(-1+sqrt((4*la[i]+square(alpha+la[i]))/square(la[i])))), alpha);

// Poisson Transmuted Exponential
pow<-function(x,k) ((x)^(k))
square<-function(x) ((x)**2)
#PTED <- function(x,theta,alpha) theta*((1-alpha)/pow(1+theta,x+1)+2*alpha/pow(1+2*theta,x+1))
PTED <- function(x,theta,alpha) exp( log(theta) + log( (1-alpha) / pow(1 + theta, x+1) + 2 * alpha / pow(1 + 2*theta,x+1)))

vector<lower=-1,upper=1>[2] alpha;
real PTED_log(int x, real theta, real alpha) {
  return log(theta) + log( (1-alpha) / pow(1 + theta, x+1) + 2 * alpha / pow(1 + 2*theta,x+1));
target += PTED_log(y[1], (2.0 - alpha[1]) / (2.0 * exp(mup[i, 1])), alpha[1]);
# shanker
shanker <-function(x,theta)
real shanker_log(int x, real theta) {
  return 2*log(theta)-(2+x)*log(1+theta)-log(1+square(theta))+log(1+x+theta+square(theta));
real la1 = exp(mup[i, 1]);

real theta1 =(2 + (2*(1 - 3*pow(la1,2)))/pow(1 + (45*pow(la1,2))/2. + (3*sqrt(3)*sqrt(pow(la1,2)*(8 + 71*pow(la1,2) + 4*pow(la1,4))))/2.,0.3333333333333333) + 
 pow(2,0.6666666666666666)*pow(2 + 45*pow(la1,2) + 3*sqrt(3)*sqrt(pow(la1,2)*(8 + 71*pow(la1,2) + 4*pow(la1,4))),0.3333333333333333))/(6. *la1) ;
target += shanker_log(y, theta1); 

Here’s how I verified the Shanker-Poisson:

Sorry I not have enough time to write a lot of documentation to it.

1 Like


I am not really versed in these kind of analysis, but I will make some addvertisement for the work of a colleague of mine, using multivariate-lognormal mixture.

They cite Chib and Winkelmann (2001), which developped a poisson-multivariate t model.

I hope it might be useful for your problem :)


1 Like

This speaks to Andrew’s only religious lexical entry:

1 Like

Wow, that post makes me feel better!

Strange feeling when a problem is in principle quite simple but there is a high wall in front and the entrance of a maze next to it.

Thanks @andre.pfeuffer ,

I tried to code those monster functions in a tidy and more computationally efficient way (still working on them).

For function name I was consistent with the neg_binomial_2_log_lpmf

  • poisson_inverse_gaussian: 3 parameters (it is a recursive probabilistic calculation, interesting; still to be transformed in log)
   real poisson_inverse_gaussian_log_original_lpdf(int y, real mu, real beta, real tau) {

  	  real exp_mu = exp(mu);
  	  real exp_mu_beta_tau = exp_mu*beta+tau;
  	  real exp_mu_beta_tau_exp_mu_2 = exp_mu_beta_tau*exp_mu*2.0;
  	  real prob;

      if(y==0) prob =

      if(y==1) prob =
        exp_mu*(1+exp_mu_beta_tau_exp_mu_2)^-0.5 *
        exp(poisson_inverse_gaussian_log_lpdf(0,mu,beta, tau));

      prob =
        exp_mu_beta_tau_exp_mu_2 /
        (1+exp_mu_beta_tau_exp_mu_2) *
        (1. - 3. /(2.*y)) *
        exp(poisson_inverse_gaussian_log_lpdf(y-1,mu,beta, tau)) +
        exp_mu^2 /
        (1+exp_mu_beta_tau_exp_mu_2) /
        (y*(y-1)) *
        exp(poisson_inverse_gaussian_log_lpdf(y-2,mu,beta, tau));

      return log(prob);

  	real poisson_inverse_gaussian_log_lpdf(int y, real mu, real beta, real tau) {

  	  real mu = mu;
  	  real mu_beta_tau = mu + log(beta+tau);
  	  real mu_beta_tau_mu_2 = mu_beta_tau + mu + 2.0;
  	  real prob;

      if(y==0) prob =
        1 / mu_beta_tau * (1-(1+mu_beta_tau_mu_2)^0.5);

      if(y==1) prob =
        mu -
        ( 0.5 * log1p(mu_beta_tau_mu_2)) +
        poisson_inverse_gaussian_log_lpdf(0,mu,beta, tau);

      prob =
        log(mu_beta_tau_mu_2) -
        log1p(mu_beta_tau_mu_2) +
        log_sum_exp(0.0, - ( log(3.) - log(2.) + log(y) ) ) +
        poisson_inverse_gaussian_log_lpdf(y-1,mu,beta, tau),

        ( 2 * mu ) -
        log1p(mu_beta_tau_mu_2) -
        log(y) + log(y-1) +
        poisson_inverse_gaussian_log_lpdf(y-2,mu,beta, tau)

      return prob;

    int poisson_inverse_gaussian_log_rng(int y, real mu, real beta, real tau){
      return poisson(inv_gaussian_rng(.., ..)); // real mu, real beta, real tau
  • poisson_lindley: 2 parameters
    real poisson_lindley_log_lpmf(int y, real mu, real alpha) {

      real exp_mu = exp(mu);
      real theta = 0.5/exp_mu[i]*(alpha+exp_mu[i]*(-1+sqrt((4*exp_mu[i]+square(alpha+exp_mu[i]))/square(exp_mu[i]))));
      real alpha_y = alpha + y;

      return (1 + alpha)*log(theta) - (1 + alpha_y)*log(1 + theta) + log(alpha + (alpha_y)/(1 + theta)) - lgamma(y+1) - lgamma(1 + alpha) + lgamma(alpha_y);
  • poisson_transmutedExponential: 2 parameters
    real poisson_transmutedExponential_log_lpmf(int y, real mu, real alpha) {

      real exp_mu = exp(mu);
      real theta = (2.0 - alpha) / (2.0 * exp_mu);

      return log(theta) + log( (1-alpha) / pow(1 + theta, y+1) + 2 * alpha / pow(1 + 2*theta,y+1));

int poisson_transmutedExponential_log_rng(real mu, real alpha) {

    Step 1: Generate ui from U(0, 1), i = 1, 2, …
    Step 2: Corresponding to each ui, compute .
    Step 3: Generate xi from .
    Step 4: Repeat Step 1, 2 and 3 to get a sample from 

  • poisson_shanker: 1 parameter (am I missing something?)
    real poisson_shanker_log_lpmf(int y, real mu) {

      real exp_mu = exp(mu);
      real exp_mu_pow_2 = pow(exp_mu,2);
      real exp_mu_pow_4 = pow(exp_mu,4);
      real m = 3*sqrt(3)*sqrt(exp_mu_pow_2*(8 + 71*exp_mu_pow_2 + 4*exp_mu_pow_4))

      real theta =
        2 +
        (2*(1 - 3*exp_mu_pow_2)) /
        pow( 1 + (45*exp_mu_pow_2) / 2. + m / 2., 0.3333333333333333 ) +
        pow(2,0.6666666666666666) *
        pow( 2 + 45*exp_mu_pow_2 + m, 0.3333333333333333 )
      ) /
      (6. *exp_mu) ;

      return 2*log(theta)-(2+y)*log(1+theta)-log(1+square(theta))+log(1+y+theta+square(theta));

I am researching how to develop generated quantities from them. Is there any general advise?

They are essential especially for discrete data, for model checking.

1 Like

Indeed. I was totally awed by the amount of research on this topic, and it seems like its been approached by a couple of different fields (coalescent theory,
mathematical ecology, bayesian nonparametrics, etc).

I would need to look into this more deeply, but at first glance, it does not seem like the Poisson is the right distribution for your problem since the total number of read counts is not a random variable. The important distinction is that in compositional data the elements of the vectors have a negative correlation, since an increased abundance of some sequences will mean a decrease in others.

In general it’s a lot easier to generate from mixture distributions than to write down closed form densities! For the poisson inverse gamma distribution for example, you can generate a inverse gamma sample, and then use that to generate a poisson sample.

  • With read counts we intend observations (exposure, how much DNA molecules you pick from solution and read/sequence), of course that is data and not a random variable. (reads are produces “reading” the DNA molecules in a liquid solution in a tube)
  • If you mean the total number of DNA molecules in the tube (of which you sample a small part) well, that is a random variable that we will never know and we don’t necessarily care about. We don’t know how many cells were mushed for a tube, we don’t know how many molecules were destroyed in the process and so on. What we really care is modelling the rate of abundance of each molecule. In this sense we are sampling from a multinomial where the underlying probabilities have really fat tails (for some molecules)

Indeed, however many people argue that since you have many different molecule types (n = 20K) the negative correlation is negligible. Maybe also because of Bob’s efforts I got attached to the concept of multinomial and trying to extend it with poor results.

Ya that’s the only one I could write straight away :) for the others I am looking in the literature, maybe there is something on those lines.

I copied the distribution functions directly out of some models.The Shanker has two parameters. One is integrated out. The Poisson Shanker has one parameter.
See (1.1) here:

It is important to fit these distributions with their correct link function. It’s not exp(mu), eg. in case of the Poisson distribution in most cases. One has to calculate the Expected Values of the distribution and solve the equation with the parameters to get the correct link function.

I don’t have their generated quantities functions. I coded them up, checked their LOOIC and
move along, when not suitable to my problem.

Its the Poisson Inverse Gaussian (PIG), a very nice distribution. Sorry, I’m a bit overloaded at the moment.

True, but I would be concerned if you are needing to use a distribution with a heavier right tail than the gamma that these results may not hold up. (Dirichlet is the normalized gamma distribution).

Ah, I see you had that one. It looks like there are expressions for the Lindley in terms of the Lambert W:

It’s a bit of a stretch to call the Lambert W closed form, but there is a boost function for it, and it wouldn’t be too hard to add a Stan header that does it. Since it’s in the generated quantities, you don’t even need the derivative. Something like

template <typename T1>
inline typename boost::math::tools::promote_args<T1>::type
lambert_w0(const T1& z,
            std::ostream* pstream__
            ) {
  return boost::math::lambert_w0(z);

@andre.pfeuffer after studying a bit the PIG distribution I have three questions:

  1. ======================================================
    the PIG (poisson-inverse-gaussian) proposed by you has three parameters. The inverse gaussian (and therefore its mixture with poisson) has two parameters as far as I understand

Is your distribution possibly a (?):

  • Sichel (or poisson-generalised-inverse-gaussian) distribution,
  • zero adjusted (hurdle), or
  • zero inflated versions of the Poisson-inverse Gaussian distribution
  1. ======================================================

If I want to create a generated quantities of the kind

x = GIG_rng(.., .., ..);
y = poisson_rng(x);

I need to know the correspondence between your implementation and the standard GIG parametrisation (assuming that you are using the generalised inverse gaussian)

Assuming that your PIG refers to the poisson-generalised-inverse-gaussian (having three parameters)

can I ask from what source this implementation has been taken?

Because a method for random numbers generator from GIG uses natural parameters (see Generalized inverse Gaussian distribution - Wikipedia for reference).

  1. ======================================================

Plotting NB vs. PIG I was able to see heavier tails right of PIG versus NB, nut thinner left tails. What am I missing?



rPIG(1000, mu = 500, sigma = 2) %>% 
  as_tibble() %>% 
  rename(PIG = value) %>% 
    rZIPIG(1000, mu = 500, sigma = 2, nu = 0.3) %>% 
      as_tibble() %>% 
      rename(ZIPIG = value)
  ) %>% 
    rnegbin(1000, mu = 500, theta = 1) %>% 
    as_tibble() %>% 
      rename(NB = value)
  ) %>% 
  gather(Distribution, `Generated samples`) %>% 
  ggplot(aes(`Generated samples` + 1, color = Distribution)) + 
  geom_vline(xintercept = 500, linetype = "dashed") +
  geom_density() + 


@martinmodrak FYI


Its two three years ago, I wrote them up and never used them since:

1 Like