Main question
Are there any numerically stable ways to implement logarithms of binomial coefficients/ ratios of gamma functions when some of the arguments are large (>1e6) and some small? (You can jump to Implementation section below to see the actual problem).
Background
I am working with @stemangiola on bringing some heavily overdispersed count distributions to Stan and I think we need some help as we are way out of our depth in both math and numerics that cropped up :-) If anybody’s working on a similar problem, or is interested in the results we’d be glad to hear from you! Our motivation comes from sequencing data, but we believe the applications are possibly broader. Below is a short summary of the current state of our project and questions we would like to seek your feedback on.
After our literature search and many failures, the most promising candidate is the generalized Waring distribution, which arises from further mixing of NB’s parameters:
The distribution has NB (negative binomial) as a limiting case.
The GWaring is neat in that unlike many other distributions proposed for heavily overdispersed data, it is cheap to evaluate. However, there are two problems:
- The paremeters a and k are symmetric - swapping them leads to the same distribution.
- Assuming a < k, the limiting case towards NB requires that \rho \rightarrow \infty, k \rightarrow \infty while \frac{k}{\rho} stays finite. This poses challenges for evaluating the gamma function ratios in the definition as we approach the NB limit.
I’ve tried multiple parametrizations and I am currently most happy with parametrization via \mu, \phi, r which mirrors NB, i.e.
and r describing tail behavior (the lpmf approaches c - (d + \frac{1}{r}) \log{y} for large y and some c,d which are determined by \mu and \phi. The distribution approaches NB(\mu,\phi) as r \rightarrow 0 (letting us to put a prior on r pushing towards NB). This leads to two solutions for a,k and I arbitrarily enforce a < k to break the symmetry.
Implementation:
Naive implementation of the GWaring is not very stable: for small r, both k and \rho blow up as expected and the lgamma
values also explode. I’ve realized I can code the distribution with a bunch of binomial coefficients, taking advantage of Stan’s implementation of lchoose
which avoids actually evaluating most of the lgamma
if the difference between the two arguments is large. This helps with stability, but not enough. The complete code for both the raw distribution and our reparametrization is below:
functions {
vector lchoose_vec_vec(vector n, vector k) {
vector[rows(n)] res;
for(i in 1:rows(n)) {
res[i] = lchoose(n[i],k[i]);
}
return res;
}
real gwaring_lpmf(int[] y, real a, real k, real rho) {
vector[size(y)] y_vec = to_vector(y);
// Naive implementation with lgamma
// return sum(lgamma(a + rho) + lgamma(k + rho)
// - lgamma(rho)
// + lgamma(a + y_vec) - lgamma(a)
// + lgamma(k + y_vec) - lgamma(k)
// - lgamma(a + k + rho + y_vec)
// - lgamma(y_vec + 1)
// );
return sum(
-lchoose_vec_vec(a + k + rho + y_vec - 1, a + y_vec - 1) - log(k + rho) +
lchoose(a + rho - 1, a - 1) + log(rho) +
lchoose_vec_vec(k + y_vec - 1, y_vec)
);
}
real gwaring4_lpmf(int[] y, real mu, real phi, real r) {
real rho_low = 2 + phi + (phi * (1 + 2 * phi + 2 * sqrt((1 + phi) * (mu + phi)))) / mu;
real rho = rho_low + (1 / r);
if(rho > 1e08) {
return neg_binomial_lpmf(y | mu, phi);
} else {
real kFirst = -phi + mu*(-2-phi+rho);
real kSecond = sqrt(phi^2 + mu^2 * (2 + phi - rho)^2 - 2 * mu * phi * (-2 -3 * phi + rho + 2 * phi * rho));
real k = (kFirst + kSecond) / (2 * phi);
real a = mu * (rho - 1) / k;
return gwaring_lpmf(y | a, k, rho);
}
}
}
The math is a bit nasty, but it was done by Mathematica, so it should be OK. There were other parametrizations that had nicer formulas, but those had properties we didn’t like.
The problem with GWaring
When I ramp up adapt_delta = 0.95
the above passes SBC (in an “intercept-only” model) with about 1 in 50 runs having 1-3 divergences or treedepth issues, and it has many divergences with the default adapt_delta
of 0.8 making me worried it won’t work great as a part of a larger model.
Below is an SBC plot with 1000 simulations, showing the same data at two resolutions:
.
The posteriors are not always nice, but usually even the runs with divergences don’t show obvious pathologies
However, changing the cutoff on \rho for when to delegate to negative binomial has huge consequences on the number of divergent transitions, my interpretation is that:
- Low cutoff creates a discontinuity as the distribution is not really NB at that point
- High cutoff lets the distribution wander into areas with large \rho and k where the computation is unstable.
Below is a plot of the relative error of the NB lpmf as an approximation to gwaring for \mu = 50, \phi = 5
The dither at the top IMHO indicates that the computation of gwaring ceases to be stable, but this happens before the relative error drops below 1e-8. While the \rho values for which this transition happens are slightly dependent on \mu and \phi the relative error is always still quite big before the transition. (the three ridges with low error are just near the places where the lpmf’s intersect).
Actual question
Do you have any ideas how to implement the formula for gwaring in a more stable manner? Or do you think I am overlooking some other possible source of problems and numerical stability is actually not an issue?
Further outlook
If you are interested in further developments or if you want to get involved with this, let us know.
Other distributions
We’ve also played with Sichel a.k.a Poisson-Generalized-Inverse-Gaussin, Poisson-Inverse-Gaussian and Poisson-Tweedie. The former two require evaluations of log BesselK with fractional orders and its derivatives, which I am slowly chopping at with the generous help of @bgoodri - we’re getting close, but we are not there yet.
The Poisson-Tweedie is even more general (has Sichel as a special case) but requires iterative calculation going from 0 to the observed count.
We have some rough implementations of all of those, but some more work on good parametrizations and stability is needed there and I can’t get the current versions we have to pass SBC. Also all of those are super expensive to evaluate.
Related work
It seems very little work has been done on implementing anything more heavy tailed than NB in a Bayesian context. The implementations of Sichel and P-IG in the gamlss
(frequentist) package have been helpful to get us started. There is a small amount of Bayesian prior work we found. If you know about anything we missed, let us know, please!
- http://doi.wiley.com/10.1111/rssa.12057 - Perrakis 2015, custom MCMC/INLA for P-IG
- http://link.springer.com/10.1007/s11004-019-09795-8 - Hystad 2019, custom MCMC for Sichel and P-IG
- http://www.nature.com/articles/ismej200869 - Quince 2008, custom MCMC for Sichel and P-IG
- http://www.tandfonline.com/doi/abs/10.1080/00949655.2011.600311 - Font 2013, P-IG in WinBugs
- http://arxiv.org/abs/1910.06008 - Huang, 2019 - a custom Metropolis Hastings for a very flexible Conway-Maxwell-Poisson distribution (I didn’t fully understand the distribution)
- https://doi.org/10.1080/19439962.2019.1638475 Ash 2019, Poisson-Weibull and Poisson-LogNormal in WinBugs
- http://www.tandfonline.com/doi/abs/10.1080/17461391.2011.589473 Castillo 2013, non-generalized Waring fitted with WinBugs
General properties of the distributions:
- GWaring
- https://rss.onlinelibrary.wiley.com/doi/abs/10.2307/2345247, Irwin 1975
- https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0167570, Vílchez-López 2016, R package for (frequentist) GWaring
- https://doi.org/10.1007/s11192-005-0250-y, Burell 2005
- Sichel
- Poisson-Tweedie
- https://onlinelibrary.wiley.com/doi/abs/10.1002/env.1036, El Shaarawi 2011
Thanks to everyone who read through this till the end!