Extended Hypergeometric Distribution in Stan

prototaxites · May 20, 2022, 11:50am

Hi all,

Putting this out there in case it’s useful to someone else. I’ve implemented the Extended Hypergeometric Distribution (also known as Fisher’s Noncentral Hypergeometric Distribution) in Stan:

Pr(X = y) = \frac{\binom{m_1}{y} \binom{m_2}{n - y}e^{\alpha y}}{\sum_{x = l}^u \binom{m_1}{x} \binom{m_2}{n - x}e^{\alpha y}}

\textrm{for } x \in [l, u], l = \textrm{max}(0, n - m_2), u = \textrm{min}(n, m_1)

Where, for an urn containing N balls, m_1 is the number of red balls, m_2 is the number of white balls, n is the total number of balls drawn, y is the number of red balls drawn, and \alpha is the log odds ratio for drawing a red or a white ball.

I adapted the code from the R code here, and I’m certain it could be made more efficient - it’s not vectorised, and I’m sure there’s better ways of doing some of the operations in a more efficient and idiomatic Stan way. However, it works well enough for my small dataset of 196 data points, fitting somewhere in the range of 8 - 15 seconds depending on the exact model, and I’m not seeing any divergences. Hope it’s useful to someone else, in any case!

  real extended_hypergeometric_lpmf(int y, int mA, int mB, int n, real alpha) {
    // Calculate support
    int L = max(0, n - mB);
    int U = min(n, mA);

    // define range of possible x values
    int length_x = U - L + 1;
    array[length_x] int x = linspaced_int_array(length_x, L, U);

    // compute pmf
    int k = min(max(mA, mB), max(n, mA + mB - n));
    vector[k + 1] k_seq;
    for(i in 1:(k + 1)) k_seq[i] = i;
    vector[k + 1] lsum = lgamma(k_seq);

    array[length_x] int x_xplus1;
    array[length_x] int x_mAx1;
    array[length_x] int x_nx1;
    array[length_x] int x_mBnx1;

    for(i in 1:length_x) {
      x_xplus1[i] = x[i] + 1;
      x_mAx1[i] = mA - x[i] + 1;
      x_nx1[i] = n - x[i] + 1;
      x_mBnx1[i] = mB - n + x[i] + 1;
    }

    vector[length_x] g = lsum[x_xplus1] + lsum[x_mAx1] + lsum[x_nx1] + lsum[x_mBnx1];
    vector[length_x] t = (to_vector(x) * alpha) - g;
    vector[length_x] h1 = t - max(t);
    vector[length_x] h2 = h1 - log_sum_exp(h1);
    
    return h2[y - L + 1];
  }

spinkney · May 20, 2022, 1:29pm

I’ve worked a bit with this. I’m not sure if you have a slightly different parameterization because I wasn’t able to view the linked article.

If I compare your implementation to BiasedUrn I get

library(BiasedUrn)
log(dFNCHypergeo(12, 25, 32, 20, 2.5))
# result -1.505209
prototaxites_extended_hypergeometric_lpmf(12, 25, 32, 20, 2.5)
# result -4.723675

Following wikipedia I tested this implementation

  real extended_hypergeometric_lpmf(int y, int m1, int m2, int n, real omega) {
    int y_min = max(0, n - m2);
    int y_max = min(n, m1);
    int length = y_max - y_min + 1;
    vector[length] P;
    real lomega = log(omega);
    int counter = 1;
    
    real ldist = lchoose(m1, y) + lchoose(m2, n - y) + y * lomega; 
    
    for (x in y_min:y_max) {
      P[counter] = lchoose(m1, x) + lchoose(m2, n - x) + x * lomega;
      counter += 1;
    }
  
    return ldist - log_sum_exp(P);
  }

Which returns the same as the BiasedUrn package of

extended_hypergeometric_lpmf(12, 25, 32, 20, 2.5)
[1] -1.505209

prototaxites · May 20, 2022, 1:40pm

Ah, I’ve parameterised this in terms of a log odds ratio, as my aim is to make inference on that scale (see A better index for analysis of co-occurrence and similarity):

x = extended_hypergeometric_lpmf(12 | 25, 32, 20, log(2.5));
# result -1.50521

I also tried programming the distribution directly following Wikipedia, but for some reason wasn’t able to get it to work properly compared to BiasedUrn, which is why I ended up with the form above. I’m not sure if there’s a reason to prefer one over the other for numerical stability.

spinkney · May 20, 2022, 2:13pm

Staying in the log scale will be best. In my version it’s just commenting out log(omega) and putting lomega in directly.

It’s possible that your version is faster because you don’t have all the calls to lchoose. I see a few things to optimize memory. We can get rid of a bunch of arrays and just have local scalars in the loop.

real extended_hypergeometric_opt_lpmf(int y, int mA, int mB, int n, real alpha) {
    // Calculate support
    int L = max(0, n - mB);
    int U = min(n, mA);

    // define range of possible x values
    int length_x = U - L + 1;
    array[length_x] int x = linspaced_int_array(length_x, L, U);

    // compute pmf
    int k = min(max(mA, mB), max(n, mA + mB - n));
    vector[k + 1] lsum = lgamma(linspaced_vector(k + 1, 1, k + 1));
    vector[length_x] t;

    for(i in 1:length_x) {
      int x_xplus1 = x[i] + 1;
      int x_mAx1 = mA - x[i] + 1;
      int x_nx1 = n - x[i] + 1;
      int x_mBnx1 = mB - n + x[i] + 1;

      real g = lsum[x_xplus1] + lsum[x_mAx1] + lsum[x_nx1] + lsum[x_mBnx1];
      t[i] = x[i] * alpha - g;
    }
    
    vector[length_x] h1 = t - max(t);
    return h1[y - L + 1] - log_sum_exp(h1);
  }

prototaxites · May 20, 2022, 4:02pm

Comparing your version (on the log scale) with the optimised one on my dataset:

# your version
All 4 chains finished successfully.
Mean chain execution time: 4.5 seconds.
Total execution time: 4.6 seconds.

# optimised version
All 4 chains finished successfully.
Mean chain execution time: 7.8 seconds.
Total execution time: 7.9 seconds.

I’m not sure if where the bottleneck might be, if it exists - I spent a while trying to eliminate the for loop, but only succeeded in slowing things down…!

spinkney · May 20, 2022, 4:08pm

Umm what happens if you change

# vector[k + 1] lsum = lgamma(linspaced_vector(k + 1, 1, k + 1));
vector[k + 1] lsum;
for(i in 1:(k + 1)) lsum = lgamma(i);

prototaxites · May 20, 2022, 4:12pm

Only very slightly shorter:

All 4 chains finished successfully.
Mean chain execution time: 7.5 seconds.
Total execution time: 7.6 seconds.

spinkney · May 20, 2022, 4:13pm

Can you share your stan program?

prototaxites · May 20, 2022, 4:15pm

functions{
  real extended_hypergeometric_opt_lpmf(int y, int mA, int mB, int n, real alpha) {
    // Calculate support
    int L = max(0, n - mB);
    int U = min(n, mA);
    
    // define range of possible x values
    int length_x = U - L + 1;
    array[length_x] int x = linspaced_int_array(length_x, L, U);
    
    // compute pmf
    int k = min(max(mA, mB), max(n, mA + mB - n));
    vector[k + 1] lsum;
    for(i in 1:(k + 1)) lsum[i] = lgamma(i);
    
    vector[length_x] t;
    
    for(i in 1:length_x) {
      int x_xplus1 = x[i] + 1;
      int x_mAx1 = mA - x[i] + 1;
      int x_nx1 = n - x[i] + 1;
      int x_mBnx1 = mB - n + x[i] + 1;
      
      real g = lsum[x_xplus1] + lsum[x_mAx1] + lsum[x_nx1] + lsum[x_mBnx1];
      t[i] = x[i] * alpha - g;
    }
    
    vector[length_x] h1 = t - max(t);
    return h1[y - L + 1] - log_sum_exp(h1);
  }
}
data {
  // Counters
  int N;
  
  // Data
  array[N] int y;
  array[N] int mA;
  array[N] int mB;
  array[N] int n;
}
parameters {
  array[N] real alpha;
}
model {
  for(i in 1:N){
      target += normal_lpdf(alpha[i] | 0, 5);
      target += extended_hypergeometric_opt_lpmf(y[i] | mA[i], mB[i], n[i], alpha[i]);
  }
}

ryan-richt · May 26, 2022, 10:25pm

Would love for this to go into mainline Stan! We actually use this distribution as well, and had to implement an approximation for it into one of our models

spinkney · May 27, 2022, 7:24am

You can create a math issue. I think my implementation is slower in Stan because the autodiff is more optimized for @prototaxites code. The memory usage in mine is lower but that only matters for larger models. In stan-math we can write the derivatives and probably get the best of both worlds.

fabio · May 30, 2022, 8:12am

Is it possible -in the future- to generalize it to the multivariate case (i.e. more than two colors)?

Topic		Replies	Views
Biased urn problem & Wallenius' noncentral hypergeometric distribution General	23	1783	December 13, 2023
Constraining Output of Multinomial Distribution Modeling	29	2166	December 24, 2021
HYoergeometric distribution in stan Modeling	2	426	February 16, 2019
Marginalizing a double binomial-Poisson hierarchical distribution Modeling stan-math , ecology	7	877	June 16, 2022
Hypergeometric posterior Modeling techniques , specification , hypothesis-testing	3	791	August 18, 2020

Extended Hypergeometric Distribution in Stan

Related topics