Biased urn problem & Wallenius' noncentral hypergeometric distribution

fabio · March 24, 2021, 8:49am

Sooo…
the aftermath of my previous topic made me discover the existence (and usefulness) of the Wallenius’ noncentral hypergeometric distribution to tackle the biased urn theory.
I ping @mathDR because he has some work done about it.
Is there any code available for this distribution?

mathDR · March 24, 2021, 3:04pm

Hey @fabio there isn’t a default distribution in Stan. I made a few attempts to code one up but landed on the following derivation. Wallenius__Noncentral_Hypergeometric_Distribution.pdf (131.8 KB)

Where the unnormalized probability mass function is

  {
    int c = size(m);
    int numpoints = 100;
    vector[c] vec_x = to_vector(x);  // number of balls of each category that were drawn
    vector[c] vec_m = to_vector(m); // number of balls of each category in the urn
    
    real D = sum(omega .* (vec_m-vec_x));
    real r = 2.5*D;  // This should be optimized using Agner Fog's derivation
    real dtau = 1./num_points;
    vector[c] r_times_omega = r * omega;

    vector[numpoints-1] return_val;

    // Special Cases
    if (d == 0) {
      return 0.;
    }

    for (i in 1:numpoints-1) 
    {
      real log_eye_dtau = log(i*dtau);
      return_val[i] = (r*d-1.0)*log_eye_dtau + 
                      sum(vec_x .* log1m_exp((r_times_omega)*log_eye_dtau)); 
    }
    return (log(r) + log(D) + log_sum_exp(return_val));
  }

This could probably be optimized some more (and I would like to add a wallenius_rng function, but haven’t needed it yet).

Please let me know if you find it useful and/or optimize it further! (Note, one could probably make use of the new newton_algebraic_solver to find the optimal r in the function, whereas I just use r = 2.5*D)

Dan

spinkney · March 24, 2021, 4:27pm

Can you use the Fisher noncentral hypergeometric distribution?

fabio · March 24, 2021, 4:31pm

@spinkey No. Fisher noncentral hypergeometric distribution generates another process. For the difference see Noncentral_hypergeometric_distributions. However, the two distribution could be of interest for me.

mathDR · July 12, 2021, 7:41pm

I noticed some typos in my writeup above. I will fix them then re-upload

spinkney · July 14, 2021, 12:17pm

I played around and got the integrate_1d to work for this distribution. No doubt that this is way slower than @mathDR approximation.

This is my first time using the integrate functionality so that was cool! I’m sure there’s a more numerically stable way to code it (@andrjohns or @bbbales2 may know a good way to use the xc parameter here).

functions {
    real wallenius_integral(real t,          // Function argument
                    real xc,         // Complement of function argument
                                     //  on the domain (defined later)
                    real[] theta,    // parameters
                    real[] x_r,      // data (real)
                    int[] x_i) {     // data (integer)
  real omega = theta[1];
  real Dinv = 1 / theta[2];
  int k = x_i[1];
  int n = x_i[2];

  return pow(1 - t^(omega * Dinv), k) * pow(1 - t^Dinv, n - k);
}

  real wallenius_lpmf(int k, int n, int[] m) {
    return -log1p(m[1]) - lbeta(m[1] - k + 1, k + 1) -
            log1p(m[2]) - lbeta(m[2] - n + k + 1, n - k + 1);
  }
}
data {
  int<lower=0> N;
  int y[N];
  int n;
  int m[2];
  real tol;
}
transformed data {
  real x_r[0];
  int x_i[N, 2];
  
  for (i in 1:N) 
    x_i[i] = {y[i], n};
}
parameters {
  real<lower=0> odds;
}
model {
  odds ~ exponential(1);
  
  for (i in 1:N) {
   real D = odds * (m[1] - y[i]) + (m[2] - n + y[i]);
   y[i] ~ wallenius(n, m);
   target += log(integrate_1d(wallenius_integral, 0, 1,
                                   {odds, D},
                                   x_r,
                                   x_i[i],
                                   tol));
  }
}
generated quantities {
  real out = odds / (1 + odds);
}

Here’s R code to generate data.

library(cmdstanr)
library(BiasedUrn)
fp <- file.path("../wallenius.stan")
mod <- cmdstan_model(fp, force_recompile = T)

# dWNCHypergeo(nrand, m1, m2, n, odds, precision=1E-7)
# m1: initial number of red balls in urn
# m2: initial number of white balls in urn
# n : total number of balls sampled
# odds: probability ratio of red over white 

# number of realizations
N <- 10

m1 <- 1000
m2 <- 1000
n <- 200

prob <- 0.8
odds <- prob / (1 - prob)
y <- rWNCHypergeo(N, m1, m2, n, odds)

mod_out <- mod$sample(
  data = list(N = N,
              y = y,
              n = n,
              m = c(m1, m2),
              tol = 0.1),
  chains = 2,
  init = 1,
  adapt_delta = 0.8,
  parallel_chains = 2,
  iter_warmup = 200,
  iter_sampling = 200
)

@andrjohns is there way to call the integrate_1d in the lpmf function? I was getting complaints about the inputs being only data.

mathDR · July 14, 2021, 2:32pm

OMG this is AWESOME! I am going to use this to check my approximation and report back! Maybe with like ~ 250 points, the left hand rule will get pretty close?

spinkney · July 14, 2021, 2:55pm

That exponential prior is too strong in larger odds. I’ve changed to odds ~

odds ~ gamma(2, 0.1);

spinkney · July 14, 2021, 5:53pm

Here’s the multivariate version

functions {
    real multi_wallenius_integral(
                    real t,          // Function argument
                    real xc,         // Complement of function argument
                                     //  on the domain (defined later)
                    real[] theta,    // parameters
                    real[] x_r,      // data (real)
                    int[] x_i) {     // data (integer)
  real Dinv = 1 / theta[1];
  int Cp1 = num_elements(x_i);
  int n = x_i[1];
  real v = 1;
  
  for (i in 2:Cp1)
      v *= pow(1 - t^(theta[i] * Dinv), x_i[i]);

  return v;
  }
  
  real multi_wallenius_lpmf(int[] k, int n, vector m) {
    int C = num_elements(m);
    real lp = 0;
    
    for (i in 1:C)
      lp += -log1p(m[i]) - lbeta(m[i] - k[i] + 1, k[i] + 1);
    
    return lp;
  }
}
data {
  int<lower=0> N;
  int<lower=0> C;
  int y[N, C + 1];
  int n;
  vector[C] m;
  real tol;
}
transformed data {
  real x_r[0];
  // int x_i[N, C + 1];
  // 
  // for (i in 2:N) 
  //   x_i[i] = append_array({n}, y[i]);

}
parameters {
  simplex[C] odds;
}
model {
  
  for (i in 1:N) {
   real D = to_row_vector(odds) * (m - to_vector(y[i, 2:C + 1]));
   y[i, 2:C + 1] ~ multi_wallenius(n, m);
   target += log(integrate_1d(
                    multi_wallenius_integral, 0, 1,
                     append_array({D}, to_array_1d(odds)),
                     x_r,
                     y[i],
                     tol));
  }

}

corresponding R code

### multi-wallenius
fp <- file.path("../multi_wallenius.stan")
mod <- cmdstan_model(fp, force_recompile = T)

N <- 10
m <- c(500, 1000, 800)
n <- 55

odds <- c(0.2, 0.7, 0.1)
y <- rMWNCHypergeo(N, m, n, odds)


meanMWNCHypergeo(m, n, odds, precision=1E-7)

mod_out <- mod$sample(
  data = list(N = N,
              C = length(m),
              y = cbind(n, t(y)),
              n = n,
              m = m,
              tol = 0.1),
  chains = 2,
  init = 1,
  adapt_delta = 0.8,
  parallel_chains = 2,
  iter_warmup = 200,
  iter_sampling = 200
)

mathDR · July 21, 2021, 4:40pm

I fixed the typos and added some exposition. Note, there is working Stan code outlined in the paper.
wallenius_noncentral_stan_writeup.pdf (132.7 KB)

spinkney · July 22, 2021, 2:48am

How does the approximation results compare to the full model? Is there bias, sampler issues or anything else that you’ve noticed? Do you know when “r” may be an issue and the Fog derivation is necessary (what is that btw)?

I’ve put my code at helpful_stan_functions/multi_wallenius_hypergeometric.stan at main · spinkney/helpful_stan_functions · GitHub

Example program at

github.com

spinkney/helpful_stan_functions/blob/main/examples/distribution/stan/multi_wallenius_hypergeometric_example.stan

functions {
 #include multi_wallenius_hypergeometric.stan
}
data {
  int<lower=0> N;
  int<lower=0> C;
  int y[N, C + 1];
  vector[C] m;
  real tol;
}
transformed data {
  real x_r[0];
}
parameters {
  simplex[C] probs;
}
model {
  for (i in 1:N) 
    y[i] ~ multi_wallenius(m, probs, x_r, tol);
}

R program to run it all at

github.com

spinkney/helpful_stan_functions/blob/main/examples/distribution/R/multi_wallenius_hypergeometric_example.r

### multi-wallenius
fp <- file.path("./examples/distribution/stan/multi_wallenius_hypergeometric_example.stan")
mod <- cmdstan_model(fp, force_recompile = T)

N <- 10
m <- c(500, 1000, 800)
n <- 55

odds <- c(0.2, 0.7, 0.1)
y <- rMWNCHypergeo(N, m, n, odds)


meanMWNCHypergeo(m, n, odds, precision=1E-7)


mod_out <- mod$sample(
  data = list(N = N,
              C = length(m),
              y = cbind(n, t(y)),
              m = m,

This file has been truncated. show original

I’d like to add the approximation (of course all credit attributed to you). Just want to put notes for when users should be cautious about using the approximation.

mathDR · July 22, 2021, 6:32pm

@spinkney did you run any examples with n closer to the total number of balls in the urn? I am getting some intialization errors when I increase n to like 2100.

Also, thanks for the question re what value of r should be used? I actually reformulated the approximation in terms of the untransformed integral. I will post that in a bit.

spinkney · July 23, 2021, 10:39am

And the error is something like log(0) cannot evaluate likelihood, right?

I’ve been able to get rid of these errors and speed the code up if we allow reals instead of ints. Since these hypergeometric distributions are working on ratios I think it’s ok to allow reals. What I do when I work with large “urn” values is that I divide by some amount 10/100/etc and work on the small scale. The parameter values I’m interested in - the “weighted odds” - stay the same but the code runs much, much faster.

For example, opening up the pmf to a pdf and letting the lbeta functions “interpolate” across the integers gives:

functions {
    real multi_wallenius_integral(
                    real t,          // Function argument
                    real xc,         // Complement of function argument
                                     //  on the domain (defined later)
                    real[] theta,    // parameters
                    real[] x_r,      // data (real)
                    int[] x_i) {     // data (integer)
  real Dinv = 1 / theta[1];
  int Cp1 = num_elements(x_r);
  real v = 1;
  
  for (i in 2:Cp1)
      v *= pow(1 - t^(theta[i] * Dinv), x_r[i]);

  return v;
  }
  
  real multi_walleniusc_lpdf(data real[] k, vector m, vector p,
                            data int[] x_i, data real tol) {
    int C = num_elements(m);
    real D = dot_product(to_row_vector(p), (m - to_vector(k[2:C + 1])));
    real lp = log(integrate_1d(
                    multi_wallenius_integral, 0, 1,
                     append_array({D}, to_array_1d(p)),
                     k,
                     x_i,
                     tol));
    
    for (i in 1:C)
      lp += -log1p(m[i]) - lbeta(m[i] - k[i + 1] + 1, k[i + 1] + 1);
      
    return lp;
  }
}
data {
  int<lower=0> N;
  int<lower=0> C;
  real y[N, C + 1];
  vector[C] m;
  real tol;
}
transformed data {
  real x_r[0];
  int x_i[0];
}
parameters {
  simplex[C] probs;
}
model {
  // for (i in 1:N)
  //   y[i] ~ multi_wallenius(m, probs, x_r, tol);
  for (i in 1:N)
    y[i] ~ multi_walleniusc(m, probs, x_i, tol);
}

To test

fp <- file.path("../multi_walleniusc.stan")
mod <- cmdstan_model(fp, force_recompile = T)

N <- 20
m <- c(2525, 1333, 888)
n <- 4234

odds <- c(0.2, 0.7, 0.1)
y <- rMWNCHypergeo(N, m, n, odds)
meanMWNCHypergeo(m, n, odds, precision=1E-7)

mod_out <- mod$sample(
  data = list(N = N,
              C = length(m),
              y = cbind(n / 100, t(y / 100)),
              m = m / 100,
              tol = 0.01),
  chains = 2,
  init = 1,
  adapt_delta = 0.8,
  parallel_chains = 2,
  iter_warmup = 200,
  iter_sampling = 200
)

mod_out

mathDR · July 23, 2021, 2:49pm

Oh I really like that idea! I asked about using high draw values because if you know apriori that n_draws < min(m_i), I would think to model the likelihood with a standard multinomial (and just assume with replacement for the individual draws). I guess there is some ambiguity because the non-central distributions’ bias depends on the number of balls in the urn, so the probability should decrement as you draw them (but the weights don’t), but “in real life” I don’t know how important that fact is.

In particular, for “most” problems where I would apply this, I would basically just want to use Categorical likelihoods for each draw, and then have the likelihood’s dimension vary as the colors run out. I tried going down that path, but when I found the non-central hypergeometric distributions, I stopped that branch.

mathDR · December 22, 2021, 3:59pm

Sorry I meant to reply to this ages ago. The “r” comes from a transform that is applied to the integral to move the mass of the integrand away from 0. This is covered here with the explanation in section 5.5.

The optimal “r” is the solution to an optimization problem that hypothetically could be done in Stan using the new algebraic solvers…

I haven’t had time to revisit this, but I plan to in the new year. I think it would make a nice write up to put somewhere!

spinkney · July 18, 2023, 4:30pm

Yo, @mathDR and @fabio. I rewrote the BiasedUrn numerical integration for the Wallenius multivariate function in Stan. It’s much faster and works for large values. You may have to increase the size of the inits to around 4 but seems to generally work well.

library(BiasedUrn)
N <- 10
m <- c(256342, 4656, 15234, 3400)
n <- 10000

odds <- c(0.2, 0.3, 0.1, 0.25)
true_omega <- odds / sum(odds)
y <- rMWNCHypergeo(N, m, n, odds)

mod_out <- mod$sample(
  data = list(N = N,
              C = length(m),
              y = t(y),
              m = m),
  chains = 2,
  init = 1,
  adapt_delta = 0.8,
  parallel_chains = 2,
  iter_warmup = 200,
  iter_sampling = 200
)

functions {
  real log1pow(real q, real x) {
    return x * log1m_exp(q);
  }
  
  real pow2_1(real q, real p) {
    // calculate 2^q and (1-2^q) without loss of precision.
    // return value is (1-2^q). 2^q is returned in y0
    real y, y1;
    real qq = q * log2();
    
    if (abs(qq) > 0.1) {
      y = exp(qq); // 2^q
      y1 = 1.0 - y; // 1-2^q
    } else {
      // Use expm1
      y1 = expm1(qq); // 2^q-1
      y = y1 + 1.0; // 2^q
      y1 = -y1; // 1-2^q
    }
    if (p != 0) {
      return y;
    }
    return y1; // Return y1
  }
  
  real lnbico(array[] int x, array[] int m) {
    return sum(lchoose(m, x));
  }
  
  array[] real findpars(array[] real x, vector omega, array[] real m) {
    // calculate d, E, r, w
    // find r to center peak of integrand at 0.5
    real dd = 0; // scaled d
    real dr; // 1/d
    real E = 0;
    real lastr = 1;
    real z, zd, rr, rrc, rt, r21, a, b, ro, k1;
    real r2 = 0.;
    real omax = max(omega); // highest omega
    real omaxr = 1. / omax; // 1/omax
    int colors = num_elements(x);
    real r = 1;
    array[colors] real omeg; // scaled weights
    
    for (i in 1 : colors) {
      // scale weights to make max = 1
      omeg[i] = omega[i] * omaxr;
      // calculate d and E
      dd += omeg[i] * (m[i] - x[i]);
      E += omeg[i] * m[i];
    }
    dr = 1.0 / dd;
    E *= dr;
    
    rr = r * omax;
    if (rr <= dr) {
      rr = 1.2 * dr;
    } // initial guess
    int j = 0;
    while (1) {
      rrc = 1.0 / rr;
      z = dd - rrc; // z(r)
      zd = rrc * rrc; // z'(r)
      
      for (i in 1 : colors) {
        rt = rr * omeg[i];
        if (rt < 100.0 && rt > 0.0) {
          // avoid overflow and division by 0
          r21 = pow2_1(rt, 0); // r2=2^r, r21=1.-2^r
          r2 = pow2_1(rt, 1);
          a = omeg[i] / r21; // omegai/(1.-2^r)
          b = x[i] * a; // x*omegai/(1.-2^r)
          z += b;
          zd += b * a * r2 * log2();
        }
      }
      if (zd == 0) {
        reject("can't find r in function findpars");
      }
      rr -= z / zd; // next r
      if (rr <= dr) {
        rr = lastr * 0.125 + dr * 0.875;
      }
      j = j + 1;
      if (j == 70) {
        reject("convergence problem searching for r in function findpars");
      }
      
      if (abs(rr - lastr) < rr * 1e-5) {
        break;
      }
      lastr = rr;
    }
    
    real rd = rr * dd;
    r = rr * omaxr;
    
    // find peak width
    real phi2d = 0.0;
    for (i in 1 : colors) {
      ro = rr * omeg[i];
      if (ro < 300 && ro > 0.0) {
        // avoid overflow and division by 0
        k1 = pow2_1(ro, 0);
        k1 = -1.0 / k1;
        k1 = omeg[i] * omeg[i] * (k1 + k1 * k1);
      } else {
        k1 = 0.0;
      }
      phi2d += x[i] * k1;
    }
    phi2d *= -4.0 * rr * rr;
    if (phi2d > 0.0) {
      reject("peak width undefined in function findpars");
    }
    real wr = sqrt(-phi2d);
    real w = 1.0 / wr;
    
    return {E, r, rd, w, wr, phi2d, log(rr) + log(dd)};
  }
  
  real integrate_step(real ta, real tb, vector omega, array[] int x,
                      array[] int m, real r, real rd, real bico) {
    // integration subprocedure used by integrate()
    // makes one integration step from ta to tb using Gauss-Legendre method.
    // result is scaled by multiplication with exp(bico)
    real ab, delta, tau, ltau, y, tot, taur, rdm1;
    int colors = num_elements(x);
    
    array[8] real xval = {-0.960289856498, -0.796666477414, -0.525532409916,
                          -0.183434642496, 0.183434642496, 0.525532409916,
                          0.796666477414, 0.960289856498};
    array[8] real weights = {0.10122853629, 0.222381034453, 0.313706645878,
                             0.362683783378, 0.362683783378, 0.313706645878,
                             0.222381034453, 0.10122853629};
    
    delta = 0.5 * (tb - ta);
    ab = 0.5 * (ta + tb);
    rdm1 = rd - 1.0;
    tot = 0.0;
    
    for (j in 1 : 8) {
      tau = ab + delta * xval[j];
      ltau = log(tau);
      taur = r * ltau;
      y = 0.0;
      for (i in 1 : colors) {
        // possible loss of precision due to subtraction here:
        y += log1pow(taur * omega[i], x[i]); // ln((1-e^taur*omegai)^xi)
      }
      y += rdm1 * ltau + bico;
      if (y > -50.0) {
        tot += weights[j] * exp(y);
      }
    }
    return delta * tot;
  }
  
  real search_inflect(real t_from, real t_to, vector omega, array[] int x,
                      real r, real rd) {
    // search for an inflection point of the integrand PHI(t) in the interval
    // t_from < t < t_to
    int colors = num_elements(x);
    real t;
    real t1; // independent variable
    vector[colors] rho = r * omega; // r*omega[i]
    real q; // t^rho[i] / (1-t^rho[i])
    real q1; // 1-t^rho[i]
    array[colors, 4, 4] real zeta; // zeta[i,j,k] coefficients
    array[4] real phi; // derivatives of phi(t) = log PHI(t)
    real Z2; // PHI''(t)/PHI(t)
    real Zd; // derivative in Newton Raphson iteration
    real rdm1; // r * d - 1
    real tr; // 1/t
    real log2t; // log2(t)
    real method; // 0 for z2'(t) method, 1 for z3(t) method
    int iter = 0; // count iterations
    real t_fromm = t_from;
    real t_too = t_to;
    
    rdm1 = rd - 1.0;
    if (t_fromm == 0.0 && rdm1 <= 1.0) {
      return 0.0;
    } //no inflection point
    t = 0.5 * (t_fromm + t_too);
    for (i in 1 : colors) {
      // calculate zeta coefficients
      zeta[i, 1, 1] = rho[i];
      zeta[i, 1, 2] = rho[i] * (rho[i] - 1.0);
      zeta[i, 2, 2] = rho[i] * rho[i];
      zeta[i, 1, 3] = zeta[i, 1, 2] * (rho[i] - 2.0);
      zeta[i, 2, 3] = zeta[i, 1, 2] * rho[i] * 3.0;
      zeta[i, 3, 3] = zeta[i, 2, 2] * rho[i] * 2.0;
    }
    
    while (1) {
      t1 = t;
      tr = 1.0 / t;
      log2t = log(t) * (1.0 / log2());
      phi[1] = 0.0;
      phi[2] = 0.0;
      phi[3] = 0.0;
      for (i in 1 : colors) {
        // calculate first 3 derivatives of phi(t)
        if (rho[i] != 0.0) {
          q1 = pow2_1(rho[i] * log2t, 0);
          q = pow2_1(rho[i] * log2t, 1) / q1;
          phi[1] -= x[i] * zeta[i, 1, 1] * q;
          phi[2] -= x[i] * q * (zeta[i, 1, 2] + q * zeta[i, 2, 2]);
          phi[3] -= x[i] * q
                    * (zeta[i, 1, 3]
                       + q * (zeta[i, 2, 3] + q * zeta[i, 3, 3]));
        }
      }
      phi[1] += rdm1;
      phi[2] -= rdm1;
      phi[3] += 2.0 * rdm1;
      phi[1] *= tr;
      phi[2] *= tr * tr;
      phi[3] *= tr * tr * tr;
      method = ceil(iter % 2); // alternate between the two methods
      Z2 = phi[1] * phi[1] + phi[2];
      Zd = method * phi[1] * phi[1] * phi[1]
           + (2.0 + method) * phi[1] * phi[2] + phi[3];
      
      if (t < 0.5) {
        if (Z2 > 0.0) {
          t_fromm = t;
        } else {
          t_too = t;
        }
        if (Zd >= 0.0) {
          // use binary search if Newton-Raphson iteration makes problems
          t = (t_fromm != 0.0 ? 0.5 : 0.2) * (t_fromm + t_too);
        } else {
          // Newton-Raphson iteration
          t -= Z2 / Zd;
        }
      } else {
        if (Z2 < 0.0) {
          t_fromm = t;
        } else {
          t_too = t;
        }
        if (Zd <= 0.0) {
          // use binary search if Newton-Raphson iteration makes problems
          t = 0.5 * (t_fromm + t_too);
        } else {
          // Newton-Raphson iteration
          t -= Z2 / Zd;
        }
      }
      if (t >= t_too) {
        t = (t1 + t_too) * 0.5;
      }
      if (t <= t_fromm) {
        t = (t1 + t_fromm) * 0.5;
      }
      iter += 1;
      if (iter > 20) {
        reject("Search for inflection point failed!");
      }
      if (abs(t - t1) < 1e-5) {
        break;
      }
    }
    return t;
  }
  
  real integrate(real accuracy, vector omega, array[] int x, array[] int m) {
    real s; // result of integration step
    real tot; // integral
    real ta, tb; // subinterval for integration step
    array[7] real pars = findpars(x, omega, m); // {E, r, rd, w, wr, phi2d}
    real E = pars[1];
    real r = pars[2];
    real rd = pars[3];
    real w = pars[4];
    real wr = pars[5];
    real phi2d = pars[6];
    real logrd = pars[7];
    real bico = lnbico(x, m);
    
    // Choose method:
    if (w < 0.02) {
      // normal method. Step length determined by peak width w
      real delta, s1;
      s1 = accuracy < 1e-9 ? 0.5 : 1.0;
      delta = s1 * w; // integration steplength
      ta = 0.5 + 0.5 * delta;
      // real ta, real tb, vector omega, array[] int x, array[] int m, real r, real rd, real bico
      tot = integrate_step(1.0 - ta, ta, omega, x, m, r, rd, bico); // first integration step around center peak
      while (1) {
        tb = ta + delta;
        if (tb > 1.0) {
          tb = 1.0;
        }
        s = integrate_step(ta, tb, omega, x, m, r, rd, bico); // integration step to the right of peak
        s += integrate_step(1.0 - tb, 1.0 - ta, omega, x, m, r, rd, bico); // integration step to the left of peak
        tot += s;
        if (s < accuracy * tot) {
          break;
        } // stop before interval finished if accuracy reached
        ta = tb;
        if (tb > 0.5 + w) {
          delta *= 2.0;
        } // increase step length far from peak
        if (tb >= 1.) {
          break;
        }
      }
    } else {
      // difficult situation. Step length determined by inflection points
      real t1 = 0;
      real t2 = 0.5;
      real tinf, delta, delta1;
      tot = 0.0;
      // do left and right half of integration interval separately:
      
      while (1) {
        // integrate from 0 to 0.5 or from 0.5 to 1
        tinf = search_inflect(t1, t2, omega, x, r, rd); // find inflection point
        delta = tinf - t1;
        if (delta > t2 - tinf) {
          delta = t2 - tinf;
        } // distance to nearest endpoint
        delta *= 1.0 / 7.0; // 1/7 will give 3 steps to nearest endpoint
        if (delta < 1e-4) {
          delta = 1e-4;
        }
        delta1 = delta;
        // integrate from tinf forwards to t2
        ta = tinf;
        while (1) {
          tb = ta + delta1;
          if (tb > t2 - 0.25 * delta1) {
            tb = t2;
          } // last step of this subinterval
          s = integrate_step(ta, tb, omega, x, m, r, rd, bico); // integration step
          tot += s;
          delta1 *= 2.0; // double steplength
          if (s < tot * 1e-4) {
            delta1 *= 8.0;
          } // large step when s small
          ta = tb;
          if (tb >= t2) {
            break;
          }
        }
        if (tinf != 0.0) {
          // integrate from tinf backwards to t1
          tb = tinf;
          while (1) {
            ta = tb - delta;
            if (ta < t1 + 0.25 * delta) {
              ta = t1;
            } // last step of this subinterval
            s = integrate_step(ta, tb, omega, x, m, r, rd, bico); // integration step
            tot += s;
            delta *= 2.0; // double steplength
            if (s < tot * 1e-4) {
              delta *= 8.0;
            } // large step when s small
            tb = ta;
            if (ta <= t1) {
              break;
            }
          }
        }
        t1 += 0.5;
        t2 += 0.5;
        if (t1 >= 1) {
          break;
        }
      }
    }
    return log(tot) + logrd;
  }
  real multi_wallenius_lpmf(data array[,] int y, data array[] int m, vector p) {
    int N = size(y);
    real lpmf = 0;
    
    for (n in 1 : N) {
      lpmf += integrate(1e-12, p, y[n], m);
    }
    
    return lpmf;
  }
}
data {
  int<lower=0> N;
  int<lower=0> C;
  array[N, C] int y;
  array[C] int m;
}
parameters {
  simplex[C] probs;
}
model {
  y ~ multi_wallenius(m, probs);
}

mathDR · November 17, 2023, 7:44pm

This is great! So recently I have been using my “homegrown” unnormalized log marginal likelihood of:

  real approximate_wallenus_integral(
    vector weights,
    data array[] int y,
    data array[] int m,
    data vector log_n_dt
    ) 
  {
    int num_categories = size(y);
    
    vector[num_categories] vec_y = to_vector(y);  // number of balls of each category that were drawn
    vector[num_categories] vec_m = to_vector(m); // number of balls of each category in the urn
    
    real one_over_D = 1.0 / dot_product(weights,(vec_m-vec_y));
    
    return log_sum_exp((log1m_exp(one_over_D .* (log_n_dt * weights')) * vec_y));
  }

  real multi_wallenius_lpmf(
    data array[] int y,
    data array[] int m,
    vector weights,
    data vector log_n_dt
    )
  {
    return approximate_wallenus_integral(weights, y, m, log_n_dt);
  }

where log_n_dt is of length number of points you slice [0,1] into for the integral and is defined as:

int num_points = 100;
  vector[num_points-1] log_n_dt;
  for (n in 1:num_points-1) 
  {
    log_n_dt[n] = log(1.0*n/num_points);
  }

The reason I like this is because the formulation allows us to write out the gradients analytically.

I have those at hand and am trying to put them into c++ to see how fast using analytical gradients will be vs autodiff.

If interested I can post the gradients here.

mathDR · November 18, 2023, 3:57pm

Also, it goes without saying but if anyone sees any speedups here, they would be MOST welcome!

I am trying to scale this up to tens of thousands of items in omega so every little bit will help!!

mathDR · December 13, 2023, 3:17am

Hey @spinkney I can definitely say that the above approximation returns the correct weights when doing inference against simulated data.

That said, as I scale the size of the weights, the autodiff gradient calculation os getting prohibitively long.

I would like to apply analytical gradients to help speed things up, but don’t have the best handle on the c++ templating. Is this something you are familiar with?

spinkney · December 13, 2023, 12:57pm

I can help with this.

To help get started with adding the cpp code and gradients check out

Gamma distribution quantile and inverse quantile function - #37 by andrjohns
Hoping for some guidance / help with implementing custom log likelihood and gradient for research project (details below) - #14 by rok_cesnovar
Vector-valued functions with manual gradients in external C++ - #3 by stevebronder

Topic		Replies	Views
Constraining Output of Multinomial Distribution Modeling	29	2277	December 24, 2021
Extended Hypergeometric Distribution in Stan Modeling specification	11	788	May 30, 2022
Trying to generalise the dirichlet-multinomial (non-analytical) framework (replacing Dirichlet with other distributions) Modeling techniques , bioinformatics , dirichlet-multinomial	23	2746	November 5, 2018
Von_mises documentation suggestion General	43	4557	July 20, 2017
Making arbitrary C++ functions available to Stan Modeling	14	2057	June 12, 2018

Biased urn problem & Wallenius' noncentral hypergeometric distribution

Related topics