Extended Hypergeometric Distribution in Stan

Hi all,

Putting this out there in case it’s useful to someone else. I’ve implemented the Extended Hypergeometric Distribution (also known as Fisher’s Noncentral Hypergeometric Distribution) in Stan:

Pr(X = y) = \frac{\binom{m_1}{y} \binom{m_2}{n - y}e^{\alpha y}}{\sum_{x = l}^u \binom{m_1}{x} \binom{m_2}{n - x}e^{\alpha y}}
\textrm{for } x \in [l, u], l = \textrm{max}(0, n - m_2), u = \textrm{min}(n, m_1)

Where, for an urn containing N balls, m_1 is the number of red balls, m_2 is the number of white balls, n is the total number of balls drawn, y is the number of red balls drawn, and \alpha is the log odds ratio for drawing a red or a white ball.

I adapted the code from the R code here, and I’m certain it could be made more efficient - it’s not vectorised, and I’m sure there’s better ways of doing some of the operations in a more efficient and idiomatic Stan way. However, it works well enough for my small dataset of 196 data points, fitting somewhere in the range of 8 - 15 seconds depending on the exact model, and I’m not seeing any divergences. Hope it’s useful to someone else, in any case!

  real extended_hypergeometric_lpmf(int y, int mA, int mB, int n, real alpha) {
    // Calculate support
    int L = max(0, n - mB);
    int U = min(n, mA);

    // define range of possible x values
    int length_x = U - L + 1;
    array[length_x] int x = linspaced_int_array(length_x, L, U);

    // compute pmf
    int k = min(max(mA, mB), max(n, mA + mB - n));
    vector[k + 1] k_seq;
    for(i in 1:(k + 1)) k_seq[i] = i;
    vector[k + 1] lsum = lgamma(k_seq);

    array[length_x] int x_xplus1;
    array[length_x] int x_mAx1;
    array[length_x] int x_nx1;
    array[length_x] int x_mBnx1;

    for(i in 1:length_x) {
      x_xplus1[i] = x[i] + 1;
      x_mAx1[i] = mA - x[i] + 1;
      x_nx1[i] = n - x[i] + 1;
      x_mBnx1[i] = mB - n + x[i] + 1;
    }

    vector[length_x] g = lsum[x_xplus1] + lsum[x_mAx1] + lsum[x_nx1] + lsum[x_mBnx1];
    vector[length_x] t = (to_vector(x) * alpha) - g;
    vector[length_x] h1 = t - max(t);
    vector[length_x] h2 = h1 - log_sum_exp(h1);
    
    return h2[y - L + 1];
  }
3 Likes

I’ve worked a bit with this. I’m not sure if you have a slightly different parameterization because I wasn’t able to view the linked article.

If I compare your implementation to BiasedUrn I get

library(BiasedUrn)
log(dFNCHypergeo(12, 25, 32, 20, 2.5))
# result -1.505209
prototaxites_extended_hypergeometric_lpmf(12, 25, 32, 20, 2.5)
# result -4.723675

Following wikipedia I tested this implementation

  real extended_hypergeometric_lpmf(int y, int m1, int m2, int n, real omega) {
    int y_min = max(0, n - m2);
    int y_max = min(n, m1);
    int length = y_max - y_min + 1;
    vector[length] P;
    real lomega = log(omega);
    int counter = 1;
    
    real ldist = lchoose(m1, y) + lchoose(m2, n - y) + y * lomega; 
    
    for (x in y_min:y_max) {
      P[counter] = lchoose(m1, x) + lchoose(m2, n - x) + x * lomega;
      counter += 1;
    }
  
    return ldist - log_sum_exp(P);
  }

Which returns the same as the BiasedUrn package of

extended_hypergeometric_lpmf(12, 25, 32, 20, 2.5)
[1] -1.505209
2 Likes

Ah, I’ve parameterised this in terms of a log odds ratio, as my aim is to make inference on that scale (see A better index for analysis of co-occurrence and similarity):

x = extended_hypergeometric_lpmf(12 | 25, 32, 20, log(2.5));
# result -1.50521

I also tried programming the distribution directly following Wikipedia, but for some reason wasn’t able to get it to work properly compared to BiasedUrn, which is why I ended up with the form above. I’m not sure if there’s a reason to prefer one over the other for numerical stability.

Staying in the log scale will be best. In my version it’s just commenting out log(omega) and putting lomega in directly.

It’s possible that your version is faster because you don’t have all the calls to lchoose. I see a few things to optimize memory. We can get rid of a bunch of arrays and just have local scalars in the loop.

real extended_hypergeometric_opt_lpmf(int y, int mA, int mB, int n, real alpha) {
    // Calculate support
    int L = max(0, n - mB);
    int U = min(n, mA);

    // define range of possible x values
    int length_x = U - L + 1;
    array[length_x] int x = linspaced_int_array(length_x, L, U);

    // compute pmf
    int k = min(max(mA, mB), max(n, mA + mB - n));
    vector[k + 1] lsum = lgamma(linspaced_vector(k + 1, 1, k + 1));
    vector[length_x] t;

    for(i in 1:length_x) {
      int x_xplus1 = x[i] + 1;
      int x_mAx1 = mA - x[i] + 1;
      int x_nx1 = n - x[i] + 1;
      int x_mBnx1 = mB - n + x[i] + 1;

      real g = lsum[x_xplus1] + lsum[x_mAx1] + lsum[x_nx1] + lsum[x_mBnx1];
      t[i] = x[i] * alpha - g;
    }
    
    vector[length_x] h1 = t - max(t);
    return h1[y - L + 1] - log_sum_exp(h1);
  }

Comparing your version (on the log scale) with the optimised one on my dataset:

# your version
All 4 chains finished successfully.
Mean chain execution time: 4.5 seconds.
Total execution time: 4.6 seconds.
# optimised version
All 4 chains finished successfully.
Mean chain execution time: 7.8 seconds.
Total execution time: 7.9 seconds.

I’m not sure if where the bottleneck might be, if it exists - I spent a while trying to eliminate the for loop, but only succeeded in slowing things down…!

Umm what happens if you change

# vector[k + 1] lsum = lgamma(linspaced_vector(k + 1, 1, k + 1));
vector[k + 1] lsum;
for(i in 1:(k + 1)) lsum = lgamma(i);

Only very slightly shorter:

All 4 chains finished successfully.
Mean chain execution time: 7.5 seconds.
Total execution time: 7.6 seconds.

Can you share your stan program?

functions{
  real extended_hypergeometric_opt_lpmf(int y, int mA, int mB, int n, real alpha) {
    // Calculate support
    int L = max(0, n - mB);
    int U = min(n, mA);
    
    // define range of possible x values
    int length_x = U - L + 1;
    array[length_x] int x = linspaced_int_array(length_x, L, U);
    
    // compute pmf
    int k = min(max(mA, mB), max(n, mA + mB - n));
    vector[k + 1] lsum;
    for(i in 1:(k + 1)) lsum[i] = lgamma(i);
    
    vector[length_x] t;
    
    for(i in 1:length_x) {
      int x_xplus1 = x[i] + 1;
      int x_mAx1 = mA - x[i] + 1;
      int x_nx1 = n - x[i] + 1;
      int x_mBnx1 = mB - n + x[i] + 1;
      
      real g = lsum[x_xplus1] + lsum[x_mAx1] + lsum[x_nx1] + lsum[x_mBnx1];
      t[i] = x[i] * alpha - g;
    }
    
    vector[length_x] h1 = t - max(t);
    return h1[y - L + 1] - log_sum_exp(h1);
  }
}
data {
  // Counters
  int N;
  
  // Data
  array[N] int y;
  array[N] int mA;
  array[N] int mB;
  array[N] int n;
}
parameters {
  array[N] real alpha;
}
model {
  for(i in 1:N){
      target += normal_lpdf(alpha[i] | 0, 5);
      target += extended_hypergeometric_opt_lpmf(y[i] | mA[i], mB[i], n[i], alpha[i]);
  }
}

Would love for this to go into mainline Stan! We actually use this distribution as well, and had to implement an approximation for it into one of our models

1 Like

You can create a math issue. I think my implementation is slower in Stan because the autodiff is more optimized for @prototaxites code. The memory usage in mine is lower but that only matters for larger models. In stan-math we can write the derivatives and probably get the best of both worlds.

3 Likes

Is it possible -in the future- to generalize it to the multivariate case (i.e. more than two colors)?