Biased urn problem & Wallenius' noncentral hypergeometric distribution

Sooo…
the aftermath of my previous topic made me discover the existence (and usefulness) of the Wallenius’ noncentral hypergeometric distribution to tackle the biased urn theory.
I ping @mathDR because he has some work done about it.
Is there any code available for this distribution?

2 Likes

Hey @fabio there isn’t a default distribution in Stan. I made a few attempts to code one up but landed on the following derivation. Wallenius__Noncentral_Hypergeometric_Distribution.pdf (131.8 KB)

Where the unnormalized probability mass function is

  {
    int c = size(m);
    int numpoints = 100;
    vector[c] vec_x = to_vector(x);  // number of balls of each category that were drawn
    vector[c] vec_m = to_vector(m); // number of balls of each category in the urn
    
    real D = sum(omega .* (vec_m-vec_x));
    real r = 2.5*D;  // This should be optimized using Agner Fog's derivation
    real dtau = 1./num_points;
    vector[c] r_times_omega = r * omega;

    vector[numpoints-1] return_val;

    // Special Cases
    if (d == 0) {
      return 0.;
    }

    for (i in 1:numpoints-1) 
    {
      real log_eye_dtau = log(i*dtau);
      return_val[i] = (r*d-1.0)*log_eye_dtau + 
                      sum(vec_x .* log1m_exp((r_times_omega)*log_eye_dtau)); 
    }
    return (log(r) + log(D) + log_sum_exp(return_val));
  }

This could probably be optimized some more (and I would like to add a wallenius_rng function, but haven’t needed it yet).

Please let me know if you find it useful and/or optimize it further! (Note, one could probably make use of the new newton_algebraic_solver to find the optimal r in the function, whereas I just use r = 2.5*D)

Dan

2 Likes

Can you use the Fisher noncentral hypergeometric distribution?

1 Like

@spinkey No. Fisher noncentral hypergeometric distribution generates another process. For the difference see Noncentral_hypergeometric_distributions. However, the two distribution could be of interest for me.

I noticed some typos in my writeup above. I will fix them then re-upload

1 Like

I played around and got the integrate_1d to work for this distribution. No doubt that this is way slower than @mathDR approximation.

This is my first time using the integrate functionality so that was cool! I’m sure there’s a more numerically stable way to code it (@andrjohns or @bbbales2 may know a good way to use the xc parameter here).

functions {
    real wallenius_integral(real t,          // Function argument
                    real xc,         // Complement of function argument
                                     //  on the domain (defined later)
                    real[] theta,    // parameters
                    real[] x_r,      // data (real)
                    int[] x_i) {     // data (integer)
  real omega = theta[1];
  real Dinv = 1 / theta[2];
  int k = x_i[1];
  int n = x_i[2];

  return pow(1 - t^(omega * Dinv), k) * pow(1 - t^Dinv, n - k);
}

  real wallenius_lpmf(int k, int n, int[] m) {
    return -log1p(m[1]) - lbeta(m[1] - k + 1, k + 1) -
            log1p(m[2]) - lbeta(m[2] - n + k + 1, n - k + 1);
  }
}
data {
  int<lower=0> N;
  int y[N];
  int n;
  int m[2];
  real tol;
}
transformed data {
  real x_r[0];
  int x_i[N, 2];
  
  for (i in 1:N) 
    x_i[i] = {y[i], n};
}
parameters {
  real<lower=0> odds;
}
model {
  odds ~ exponential(1);
  
  for (i in 1:N) {
   real D = odds * (m[1] - y[i]) + (m[2] - n + y[i]);
   y[i] ~ wallenius(n, m);
   target += log(integrate_1d(wallenius_integral, 0, 1,
                                   {odds, D},
                                   x_r,
                                   x_i[i],
                                   tol));
  }
}
generated quantities {
  real out = odds / (1 + odds);
}

Here’s R code to generate data.

library(cmdstanr)
library(BiasedUrn)
fp <- file.path("../wallenius.stan")
mod <- cmdstan_model(fp, force_recompile = T)

# dWNCHypergeo(nrand, m1, m2, n, odds, precision=1E-7)
# m1: initial number of red balls in urn
# m2: initial number of white balls in urn
# n : total number of balls sampled
# odds: probability ratio of red over white 

# number of realizations
N <- 10

m1 <- 1000
m2 <- 1000
n <- 200

prob <- 0.8
odds <- prob / (1 - prob)
y <- rWNCHypergeo(N, m1, m2, n, odds)

mod_out <- mod$sample(
  data = list(N = N,
              y = y,
              n = n,
              m = c(m1, m2),
              tol = 0.1),
  chains = 2,
  init = 1,
  adapt_delta = 0.8,
  parallel_chains = 2,
  iter_warmup = 200,
  iter_sampling = 200
)

@andrjohns is there way to call the integrate_1d in the lpmf function? I was getting complaints about the inputs being only data.

2 Likes

OMG this is AWESOME! I am going to use this to check my approximation and report back! Maybe with like ~ 250 points, the left hand rule will get pretty close?

That exponential prior is too strong in larger odds. I’ve changed to odds ~

odds ~ gamma(2, 0.1);

Here’s the multivariate version

functions {
    real multi_wallenius_integral(
                    real t,          // Function argument
                    real xc,         // Complement of function argument
                                     //  on the domain (defined later)
                    real[] theta,    // parameters
                    real[] x_r,      // data (real)
                    int[] x_i) {     // data (integer)
  real Dinv = 1 / theta[1];
  int Cp1 = num_elements(x_i);
  int n = x_i[1];
  real v = 1;
  
  for (i in 2:Cp1)
      v *= pow(1 - t^(theta[i] * Dinv), x_i[i]);

  return v;
  }
  
  real multi_wallenius_lpmf(int[] k, int n, vector m) {
    int C = num_elements(m);
    real lp = 0;
    
    for (i in 1:C)
      lp += -log1p(m[i]) - lbeta(m[i] - k[i] + 1, k[i] + 1);
    
    return lp;
  }
}
data {
  int<lower=0> N;
  int<lower=0> C;
  int y[N, C + 1];
  int n;
  vector[C] m;
  real tol;
}
transformed data {
  real x_r[0];
  // int x_i[N, C + 1];
  // 
  // for (i in 2:N) 
  //   x_i[i] = append_array({n}, y[i]);

}
parameters {
  simplex[C] odds;
}
model {
  
  for (i in 1:N) {
   real D = to_row_vector(odds) * (m - to_vector(y[i, 2:C + 1]));
   y[i, 2:C + 1] ~ multi_wallenius(n, m);
   target += log(integrate_1d(
                    multi_wallenius_integral, 0, 1,
                     append_array({D}, to_array_1d(odds)),
                     x_r,
                     y[i],
                     tol));
  }

}

corresponding R code

### multi-wallenius
fp <- file.path("../multi_wallenius.stan")
mod <- cmdstan_model(fp, force_recompile = T)

N <- 10
m <- c(500, 1000, 800)
n <- 55

odds <- c(0.2, 0.7, 0.1)
y <- rMWNCHypergeo(N, m, n, odds)


meanMWNCHypergeo(m, n, odds, precision=1E-7)

mod_out <- mod$sample(
  data = list(N = N,
              C = length(m),
              y = cbind(n, t(y)),
              n = n,
              m = m,
              tol = 0.1),
  chains = 2,
  init = 1,
  adapt_delta = 0.8,
  parallel_chains = 2,
  iter_warmup = 200,
  iter_sampling = 200
)

1 Like

I fixed the typos and added some exposition. Note, there is working Stan code outlined in the paper.
wallenius_noncentral_stan_writeup.pdf (132.7 KB)

1 Like

How does the approximation results compare to the full model? Is there bias, sampler issues or anything else that you’ve noticed? Do you know when “r” may be an issue and the Fog derivation is necessary (what is that btw)?

I’ve put my code at helpful_stan_functions/multi_wallenius_hypergeometric.stan at main · spinkney/helpful_stan_functions · GitHub

Example program at

R program to run it all at

I’d like to add the approximation (of course all credit attributed to you). Just want to put notes for when users should be cautious about using the approximation.

2 Likes

@spinkney did you run any examples with n closer to the total number of balls in the urn? I am getting some intialization errors when I increase n to like 2100.

Also, thanks for the question re what value of r should be used? I actually reformulated the approximation in terms of the untransformed integral. I will post that in a bit.

And the error is something like log(0) cannot evaluate likelihood, right?

I’ve been able to get rid of these errors and speed the code up if we allow reals instead of ints. Since these hypergeometric distributions are working on ratios I think it’s ok to allow reals. What I do when I work with large “urn” values is that I divide by some amount 10/100/etc and work on the small scale. The parameter values I’m interested in - the “weighted odds” - stay the same but the code runs much, much faster.

For example, opening up the pmf to a pdf and letting the lbeta functions “interpolate” across the integers gives:

functions {
    real multi_wallenius_integral(
                    real t,          // Function argument
                    real xc,         // Complement of function argument
                                     //  on the domain (defined later)
                    real[] theta,    // parameters
                    real[] x_r,      // data (real)
                    int[] x_i) {     // data (integer)
  real Dinv = 1 / theta[1];
  int Cp1 = num_elements(x_r);
  real v = 1;
  
  for (i in 2:Cp1)
      v *= pow(1 - t^(theta[i] * Dinv), x_r[i]);

  return v;
  }
  
  real multi_walleniusc_lpdf(data real[] k, vector m, vector p,
                            data int[] x_i, data real tol) {
    int C = num_elements(m);
    real D = dot_product(to_row_vector(p), (m - to_vector(k[2:C + 1])));
    real lp = log(integrate_1d(
                    multi_wallenius_integral, 0, 1,
                     append_array({D}, to_array_1d(p)),
                     k,
                     x_i,
                     tol));
    
    for (i in 1:C)
      lp += -log1p(m[i]) - lbeta(m[i] - k[i + 1] + 1, k[i + 1] + 1);
      
    return lp;
  }
}
data {
  int<lower=0> N;
  int<lower=0> C;
  real y[N, C + 1];
  vector[C] m;
  real tol;
}
transformed data {
  real x_r[0];
  int x_i[0];
}
parameters {
  simplex[C] probs;
}
model {
  // for (i in 1:N)
  //   y[i] ~ multi_wallenius(m, probs, x_r, tol);
  for (i in 1:N)
    y[i] ~ multi_walleniusc(m, probs, x_i, tol);
}

To test

fp <- file.path("../multi_walleniusc.stan")
mod <- cmdstan_model(fp, force_recompile = T)

N <- 20
m <- c(2525, 1333, 888)
n <- 4234

odds <- c(0.2, 0.7, 0.1)
y <- rMWNCHypergeo(N, m, n, odds)
meanMWNCHypergeo(m, n, odds, precision=1E-7)

mod_out <- mod$sample(
  data = list(N = N,
              C = length(m),
              y = cbind(n / 100, t(y / 100)),
              m = m / 100,
              tol = 0.01),
  chains = 2,
  init = 1,
  adapt_delta = 0.8,
  parallel_chains = 2,
  iter_warmup = 200,
  iter_sampling = 200
)

mod_out

Oh I really like that idea! I asked about using high draw values because if you know apriori that n_draws < min(m_i), I would think to model the likelihood with a standard multinomial (and just assume with replacement for the individual draws). I guess there is some ambiguity because the non-central distributions’ bias depends on the number of balls in the urn, so the probability should decrement as you draw them (but the weights don’t), but “in real life” I don’t know how important that fact is.

In particular, for “most” problems where I would apply this, I would basically just want to use Categorical likelihoods for each draw, and then have the likelihood’s dimension vary as the colors run out. I tried going down that path, but when I found the non-central hypergeometric distributions, I stopped that branch.