Implementing heavy-tailed count distributions - numerical issues

Main question

Are there any numerically stable ways to implement logarithms of binomial coefficients/ ratios of gamma functions when some of the arguments are large (>1e6) and some small? (You can jump to Implementation section below to see the actual problem).

Background

I am working with @stemangiola on bringing some heavily overdispersed count distributions to Stan and I think we need some help as we are way out of our depth in both math and numerics that cropped up :-) If anybody’s working on a similar problem, or is interested in the results we’d be glad to hear from you! Our motivation comes from sequencing data, but we believe the applications are possibly broader. Below is a short summary of the current state of our project and questions we would like to seek your feedback on.

After our literature search and many failures, the most promising candidate is the generalized Waring distribution, which arises from further mixing of NB’s parameters:

The distribution has NB (negative binomial) as a limiting case.

The GWaring is neat in that unlike many other distributions proposed for heavily overdispersed data, it is cheap to evaluate. However, there are two problems:

  1. The paremeters a and k are symmetric - swapping them leads to the same distribution.
  2. Assuming a < k, the limiting case towards NB requires that \rho \rightarrow \infty, k \rightarrow \infty while \frac{k}{\rho} stays finite. This poses challenges for evaluating the gamma function ratios in the definition as we approach the NB limit.

I’ve tried multiple parametrizations and I am currently most happy with parametrization via \mu, \phi, r which mirrors NB, i.e.

E_{GWaring}(Y) = \mu \\ Var_{GWaring}(Y) = \mu + \frac{\mu^2}{\phi}

and r describing tail behavior (the lpmf approaches c - (d + \frac{1}{r}) \log{y} for large y and some c,d which are determined by \mu and \phi. The distribution approaches NB(\mu,\phi) as r \rightarrow 0 (letting us to put a prior on r pushing towards NB). This leads to two solutions for a,k and I arbitrarily enforce a < k to break the symmetry.

Implementation:

Naive implementation of the GWaring is not very stable: for small r, both k and \rho blow up as expected and the lgamma values also explode. I’ve realized I can code the distribution with a bunch of binomial coefficients, taking advantage of Stan’s implementation of lchoose which avoids actually evaluating most of the lgamma if the difference between the two arguments is large. This helps with stability, but not enough. The complete code for both the raw distribution and our reparametrization is below:

functions {
  vector lchoose_vec_vec(vector n, vector k) {
    vector[rows(n)] res;
    for(i in 1:rows(n)) {
      res[i] = lchoose(n[i],k[i]);
    }
    return res;
  }

  real gwaring_lpmf(int[] y, real a, real k, real rho) {
    vector[size(y)] y_vec = to_vector(y);

    // Naive implementation with lgamma
    // return sum(lgamma(a + rho) + lgamma(k + rho)
    //           - lgamma(rho) 
    //           + lgamma(a + y_vec) - lgamma(a)
    //           + lgamma(k + y_vec) - lgamma(k)
    //           - lgamma(a + k + rho + y_vec) 
    //           - lgamma(y_vec + 1)
    //           );

    return sum(
      -lchoose_vec_vec(a + k + rho + y_vec - 1, a + y_vec - 1) - log(k + rho) +
      lchoose(a + rho - 1, a - 1) + log(rho) +
      lchoose_vec_vec(k + y_vec - 1, y_vec)

      );
}

  real gwaring4_lpmf(int[] y, real mu, real phi, real r) {
    real rho_low =  2 + phi + (phi * (1 + 2 * phi + 2 * sqrt((1 + phi) * (mu + phi)))) / mu;
    real rho = rho_low + (1 / r);

    if(rho > 1e08) {
       return neg_binomial_lpmf(y | mu, phi);
    } else {
      real kFirst = -phi + mu*(-2-phi+rho);
      real kSecond = sqrt(phi^2 + mu^2 * (2 + phi - rho)^2 - 2 * mu * phi * (-2 -3 * phi + rho + 2 * phi * rho));
      real k = (kFirst + kSecond) / (2 * phi);
      real a = mu * (rho - 1) / k;

      return gwaring_lpmf(y | a, k, rho);
    }
  }
}

The math is a bit nasty, but it was done by Mathematica, so it should be OK. There were other parametrizations that had nicer formulas, but those had properties we didn’t like.

The problem with GWaring

When I ramp up adapt_delta = 0.95 the above passes SBC (in an “intercept-only” model) with about 1 in 50 runs having 1-3 divergences or treedepth issues, and it has many divergences with the default adapt_delta of 0.8 making me worried it won’t work great as a part of a larger model.

Below is an SBC plot with 1000 simulations, showing the same data at two resolutions:

.

The posteriors are not always nice, but usually even the runs with divergences don’t show obvious pathologies

However, changing the cutoff on \rho for when to delegate to negative binomial has huge consequences on the number of divergent transitions, my interpretation is that:

  • Low cutoff creates a discontinuity as the distribution is not really NB at that point
  • High cutoff lets the distribution wander into areas with large \rho and k where the computation is unstable.

Below is a plot of the relative error of the NB lpmf as an approximation to gwaring for \mu = 50, \phi = 5

The dither at the top IMHO indicates that the computation of gwaring ceases to be stable, but this happens before the relative error drops below 1e-8. While the \rho values for which this transition happens are slightly dependent on \mu and \phi the relative error is always still quite big before the transition. (the three ridges with low error are just near the places where the lpmf’s intersect).

Actual question

Do you have any ideas how to implement the formula for gwaring in a more stable manner? Or do you think I am overlooking some other possible source of problems and numerical stability is actually not an issue?

Further outlook

If you are interested in further developments or if you want to get involved with this, let us know.

Other distributions
We’ve also played with Sichel a.k.a Poisson-Generalized-Inverse-Gaussin, Poisson-Inverse-Gaussian and Poisson-Tweedie. The former two require evaluations of log BesselK with fractional orders and its derivatives, which I am slowly chopping at with the generous help of @bgoodri - we’re getting close, but we are not there yet.
The Poisson-Tweedie is even more general (has Sichel as a special case) but requires iterative calculation going from 0 to the observed count.

We have some rough implementations of all of those, but some more work on good parametrizations and stability is needed there and I can’t get the current versions we have to pass SBC. Also all of those are super expensive to evaluate.

Related work
It seems very little work has been done on implementing anything more heavy tailed than NB in a Bayesian context. The implementations of Sichel and P-IG in the gamlss (frequentist) package have been helpful to get us started. There is a small amount of Bayesian prior work we found. If you know about anything we missed, let us know, please!

General properties of the distributions:

Thanks to everyone who read through this till the end!

6 Likes

Did you try Stan’s real log_rising_factorial (real x, real n) function?

Later: Did you note the first multiplier can be written as Pochhammer: (\rho)_k / (a+\rho)_k?

You can parameterize \rho and k as \mu * \phi and (1-\mu) * \phi with \mu \in [0,1] and \phi > 0.

4 Likes

In case of heavy-tailed continuous distributions such as horseshoe, there is evidence that scale mixture presentation can be easier for HMC than direct implementation. Have you experimented with explicit scale mixture of NBs?

4 Likes

@PhilClemson worked on the numerics for the normal lcdfs recently. Maybe he has some ideas?

Thanks for the responses! I do have some followup questions:

Both log_rising_factorial (a.k.a. log Pochhammer) and log_falling_factorial are internally implemented in Stan as differences of lgamma calls and their derivatives are also (as far as I can tell from the code) identical to what would be computed via lgamma difference so it should reduce to exactly the same computation as I have now. Do you have a reason to believe otherwise?

I am not sure I can follow your logic - what would be the motivation for such a reparametrization?

That’s an interesting perspective and something we should try, at least as a baseline. However I just spent two hours trying to get NB scale mixture to work in a simple setting and I am not there yet, so it looks like mixtures have their own challenges… :-(

You might want to test also Poisson scale mixture. Likelihood of NB is not log-concave which may lead to multimodality while likelihood of Poisson is log-concave. Switching to Poisson adds the number of parameters needed and moves the multimodality elsewhere, but might work better. If NB or Poisson scale mixture fails, I’m interested to learn why, so if you get some experiments done you can send them to me, and I can try to understand why it would be different from normal scale mixture.

2 Likes

Apologies for the late response to this.

As @bbbales2 pointed out I ran into a similar issue, particularly when computing the derivatives of the normal lcdf which results in a problematic ratio where you can get into 0/0 regions and the like. Abramovitz and Stegun (1964) has lots of stable numerical approximations to deal with these sort of cases which you might be able to use. Otherwise, in the remaining troublesome regions you can always derive Taylor expansions using Maple and join these up piecewise (a bit messy but it works!).

1 Like