# Dirichlet-Multinomial, integrating out individual probabilities or not

I am working towards a larger model that contains a Dirichlet-Multinomial distribution, and I wanted to investigate if it is faster to use the analytical form of the distribution that integrates out individual probabilities, henceforth denoted “implicit” (imp) code, or explicitly (exp) include the individual probabilities, denoted {\bf{r}}. That is, either

P\left( {{\bf{X}}|{\rm{\alpha }}} \right) = {{\Gamma \left( {\sum {{\alpha _k}} } \right)\Gamma \left( {N + 1} \right)} \over {\Gamma \left( {N + \sum {{\alpha _k}} } \right)}}\prod\limits_{k = 1}^m {{{\Gamma ({X_k} + {\alpha _k})} \over {\Gamma ({\alpha _k})\Gamma \left( {{X_k} + 1} \right)}}}

or

\eqalign{ & {\bf{X}} \sim \mathrm{Multinomial}\left( {\bf{r}} \right) \cr & {\bf{r}} \sim \mathrm{Dirichlet}\left( {\rm{\alpha }} \right) \cr}

Here is what I found in a simulation study with a uniform Dirichlet process, i.e. identical concentration parameters {\alpha _1} = {\alpha _2} = ... = {\alpha _m}. The example is using and m=100 compartments and N=1000 trials. Compilation time is not included.

The y axis shows computation time per effective sample (timeperneff). The conclusions are:

• The explicit approach takes longer for small and large alphas but is faster for intermediate alphas.
• The difference is substantial (notice the log y axis.)
• The implicit approach is quite stable in terms of computation time per effective sample.
• The explicit approach also suffers from divergent transitions (isdev) at low alphas.

Questions:
All and all, it seems like a safer choice to integrate out the individual probabilities, which are not of interest. But it would depend on the data what is the faster approach. Does anyone have experiences from real data? Are there pitfalls ahead?
Is there an intuitive explanation for the non-linear effect of alpha on computational efficiency under the explicit approach?
Is there a way to speed up the calculations of the implicit approach that reduces the number of log-gamma functions that must be evaluated? See implicit code below.
The divergent transitions at low alphas are likely a result of steep cut-offs at zero in the conditional distribution of some r. Is there a straightforward reparameterization that could solve this, given that r is a simplex?

Explicit code:

data {
int<lower=0> m;
int<lower=0> X[m];
}

parameters {
real logalpha;
simplex[m] r;
}

transformed parameters{
real<lower=0> alpha=exp(logalpha);
vector<lower=0>[m] alphavec;
for (i in 1:m){
alphavec[i]=alpha;

}

}
model {

X ~ multinomial(r);
r ~ dirichlet(alphavec);
alpha ~ gamma(.01,.01);
target += logalpha;
}



Implicit code:

data {
int<lower=0> m;
int<lower=0> X[m];
int<lower=0> sumX;

}

parameters {
real logalpha;

}

transformed parameters{
real<lower=0> alpha=exp(logalpha);
real<lower=0> sumalpha;
vector<lower=0>[m] alphavec;
vector<lower=0>[m] xplusaplha;

for (i in 1:m){
alphavec[i]=alpha;
xplusaplha[i]=X[i]+alphavec[i];
}
sumalpha=sum(alphavec);

}
model {

target+= lgamma(sumalpha)  -lgamma(sumX+sumalpha) +sum(lgamma(xplusaplha))- sum(lgamma(alphavec)); //- sumlgammaXplus1+ lgamma(sumX+1);
target += logalpha;
alpha ~ gamma(.01,.01);
}



1 Like

The short answer is “geometry”. Depending on data sizes, priors, etc., the shape of the high probability mass within the posterior will change and that’s what controls exploration efficiency (effective sample size per iteration). With very high curvature, especially if it varies, Stan has to take very small steps. I’m afraid I don’t have any intuition for this particular case.

You can speed up your implicit transformed parameter blocks by taking advantage of the fact that you’re using a degenerate Dirichlet with a single value of alpha. We should really have both that case and the marginalized out Dirichlet-multinomial case hard coded with analytic gradients in C++. In your case, you don’t ever have to create alphavec. You can just code things directly this way:

transformed parameters{
real<lower=0> alpha = exp(logalpha);
real<lower=0> sumalpha = m * alpha;
vector<lower=0>[m] xplusaplha = X + alpha;
}


and then you can replace sum(lgamma(alpha)) with m * lgamma(alpha). This should be a lot faster, but might not make much difference overall given the number of evals going on here.

Rather than coding logalpha yourself with the Jacobian (target += logalpha), it’s much easier to just declare

parameters {
real<lower=0> alpha;


which does the same thing under the hood implicitly, but doesn’t require a logalpha value.

You may also want to consider a more suitable prior than gamma(0.01, 0.01) on alpha. That has a mean of 1 and a sd of 10, which doesn’t give you a lot of latitude on the larger side. We’ve been inclined to use things like a pareto(0.1, 1.5), which makes alpha uniform on the log scale and has infinite variance (see the discussion of hierarchical beta priors in the rats example in BDA3, chapter 5).

Thanks Bob!

The next step of the model development relaxes the assumption of a single alpha, so I started the code development with that in mind.

Has the under-the-hood log transformation always been implemented in stan? I was convinced I had in the past circumvented divergent transitions this way, but now when I compare e.g.

data {
real<lower=0> shape;
real<lower=0> rate;
}
parameters{
real logx;
}
transformed parameters {
real<lower=0> x;
x=exp(logx);
}
model {
x~gamma(shape,rate);
target +=logx;
}


to

data {
real<lower=0> shape;
real<lower=0> rate;
}
parameters {
real<lower=0> x;
}
model {
x~gamma(shape,rate);
}
, the approaches seems to be equivalent, giving roughly the same effective sample size and, when shape=rate="something small" roughly the same number of divergent transitions. Is there a similar e.g. logit-transform implemented for parameters with also an upper bound?


Thanks for the suggestion on the prior. The gamma (0.01,0.01) was a placeholder I included because, when analyzing simulated data generated with a high alpha, a prior uniform over alpha didn’t work. Essentially, such data isn’t strong enough to differentiate between alpha=“something large” and alpha=“infinity”, i.e. a uniform multinational, and the sampler wanders off. In your experience, are such issues avoided with a prior uniform over log-alpha?

BTW, how is a pareto(0.1,1.5) uniform on the log-scale? (I sampled some alphas from such distribution and plot a histogram below). Or do you mean that the tail is uniform?