Hypergeometric posterior

Hi all!

I am trying to estimate the posterior distribution of the total success count of a hypergeometric distribution.

I have 39 farms (population) from which I am getting a random sample without replacement of 30 farms. I observe 15 successes and what to know my uncertainty around the number of successes in the population (a).

  • Since Stan doesn’t support discrete parameters, I left the parameters block empty and did the log density manipulation in the transformed parameters block. Does this approach sounds good to you? I am new to Stan and marginalization, so not sure if I am doing this correctly. Or do I need a continuous parameter anyway?
  • I am using a uniform prior, are there any other prior that are recommended for the hypergeometric distribution?
data {
    int<lower=1> pop;                   // Population
    int<lower=1, upper=pop> ss;         // Sample size
    int<lower=0, upper=ss> k1;          // Number of successes observed
}
transformed data {
    // The maximum number of successes is the population minus observed failures
    int<lower=0, upper=pop> max_a = pop-(ss-k1);  
    // The number of successes can't be less than what observed in the sample
    int<lower=0, upper=ss> min_a = k1;       
    // Number of possible values a can take
    int<lower=1, upper=pop> len_a = max_a-min_a+1;
}
parameters {
}
transformed parameters {
    vector[pop+1] lp;  // We add 1 to include zero successes
    for(a in 0:pop){
        if (a < min_a)
            lp[a+1] = negative_infinity();
        else if (a > max_a)
            lp[a+1] = negative_infinity();
        else
            lp[a+1] = hypergeometric_lpmf(k1 | ss, a, pop-a) + log(1.0/len_a);
    }
}
model {
    target += log_sum_exp(lp);
}
generated quantities{
  int i;
  simplex[pop+1] tp;
  tp = softmax(lp);
  i = categorical_rng(tp);
}

Example data:

  • pop = 39
  • ss = 30
  • k1 = 15
1 Like

In case it helps, I am adding the output of the model for pop=39, ss=30, k1=15

  • tp[1] is (I think) the posterior probability of a = 0
  • tp[15] is the posterior probability of a = 14
  • and so on…

Inference for Stan model: anon_model_269b3e5d623b621b4f88db6edb173cfd.
4 chains, each with iter=10000; warmup=0; thin=1;
post-warmup draws per chain=10000, total post-warmup draws=40000.

     mean se_mean      sd   2.5%    25%    50%    75%  97.5%  n_eff   Rhat

lp[1] -inf nan nan -inf -inf -inf -inf -inf nan nan
lp[2] -inf nan nan -inf -inf -inf -inf -inf nan nan
lp[3] -inf nan nan -inf -inf -inf -inf -inf nan nan
lp[4] -inf nan nan -inf -inf -inf -inf -inf nan nan
lp[5] -inf nan nan -inf -inf -inf -inf -inf nan nan
lp[6] -inf nan nan -inf -inf -inf -inf -inf nan nan
lp[7] -inf nan nan -inf -inf -inf -inf -inf nan nan
lp[8] -inf nan nan -inf -inf -inf -inf -inf nan nan
lp[9] -inf nan nan -inf -inf -inf -inf -inf nan nan
lp[10] -inf nan nan -inf -inf -inf -inf -inf nan nan
lp[11] -inf nan nan -inf -inf -inf -inf -inf nan nan
lp[12] -inf nan nan -inf -inf -inf -inf -inf nan nan
lp[13] -inf nan nan -inf -inf -inf -inf -inf nan nan
lp[14] -inf nan nan -inf -inf -inf -inf -inf nan nan
lp[15] -inf nan nan -inf -inf -inf -inf -inf nan nan
lp[16] -7.39 0.0 0.0 -7.39 -7.39 -7.39 -7.39 -7.39 2 1.0
lp[17] -5.6 1.3e-15 1.8e-15 -5.6 -5.6 -5.6 -5.6 -5.6 2 1.0
lp[18] -4.51 6.3e-16 8.9e-16 -4.51 -4.51 -4.51 -4.51 -4.51 2 1.0
lp[19] -3.87 3.1e-16 4.4e-16 -3.87 -3.87 -3.87 -3.87 -3.87 2 1.0
lp[20] -3.56 9.4e-16 1.3e-15 -3.56 -3.56 -3.56 -3.56 -3.56 2 1.0
lp[21] -3.56 9.4e-16 1.3e-15 -3.56 -3.56 -3.56 -3.56 -3.56 2 1.0
lp[22] -3.87 3.1e-16 4.4e-16 -3.87 -3.87 -3.87 -3.87 -3.87 2 1.0
lp[23] -4.51 6.3e-16 8.9e-16 -4.51 -4.51 -4.51 -4.51 -4.51 2 1.0
lp[24] -5.6 1.3e-15 1.8e-15 -5.6 -5.6 -5.6 -5.6 -5.6 2 1.0
lp[25] -7.39 0.0 0.0 -7.39 -7.39 -7.39 -7.39 -7.39 2 1.0
lp[26] -inf nan nan -inf -inf -inf -inf -inf nan nan
lp[27] -inf nan nan -inf -inf -inf -inf -inf nan nan
lp[28] -inf nan nan -inf -inf -inf -inf -inf nan nan
lp[29] -inf nan nan -inf -inf -inf -inf -inf nan nan
lp[30] -inf nan nan -inf -inf -inf -inf -inf nan nan
lp[31] -inf nan nan -inf -inf -inf -inf -inf nan nan
lp[32] -inf nan nan -inf -inf -inf -inf -inf nan nan
lp[33] -inf nan nan -inf -inf -inf -inf -inf nan nan
lp[34] -inf nan nan -inf -inf -inf -inf -inf nan nan
lp[35] -inf nan nan -inf -inf -inf -inf -inf nan nan
lp[36] -inf nan nan -inf -inf -inf -inf -inf nan nan
lp[37] -inf nan nan -inf -inf -inf -inf -inf nan nan
lp[38] -inf nan nan -inf -inf -inf -inf -inf nan nan
lp[39] -inf nan nan -inf -inf -inf -inf -inf nan nan
lp[40] -inf nan nan -inf -inf -inf -inf -inf nan nan
i 20.49 8.3e-3 1.67 17.0 19.0 20.0 22.0 24.0 40190 1.0
tp[1] 0.0 nan 0.0 0.0 0.0 0.0 0.0 0.0 nan nan
tp[2] 0.0 nan 0.0 0.0 0.0 0.0 0.0 0.0 nan nan
tp[3] 0.0 nan 0.0 0.0 0.0 0.0 0.0 0.0 nan nan
tp[4] 0.0 nan 0.0 0.0 0.0 0.0 0.0 0.0 nan nan
tp[5] 0.0 nan 0.0 0.0 0.0 0.0 0.0 0.0 nan nan
tp[6] 0.0 nan 0.0 0.0 0.0 0.0 0.0 0.0 nan nan
tp[7] 0.0 nan 0.0 0.0 0.0 0.0 0.0 0.0 nan nan
tp[8] 0.0 nan 0.0 0.0 0.0 0.0 0.0 0.0 nan nan
tp[9] 0.0 nan 0.0 0.0 0.0 0.0 0.0 0.0 nan nan
tp[10] 0.0 nan 0.0 0.0 0.0 0.0 0.0 0.0 nan nan
tp[11] 0.0 nan 0.0 0.0 0.0 0.0 0.0 0.0 nan nan
tp[12] 0.0 nan 0.0 0.0 0.0 0.0 0.0 0.0 nan nan
tp[13] 0.0 nan 0.0 0.0 0.0 0.0 0.0 0.0 nan nan
tp[14] 0.0 nan 0.0 0.0 0.0 0.0 0.0 0.0 nan nan
tp[15] 0.0 nan 0.0 0.0 0.0 0.0 0.0 0.0 nan nan
tp[16] 4.8e-3 1.2e-18 1.7e-18 4.8e-3 4.8e-3 4.8e-3 4.8e-3 4.8e-3 2 1.0
tp[17] 0.03 1.2e-17 1.7e-17 0.03 0.03 0.03 0.03 0.03 2 1.0
tp[18] 0.08 9.8e-18 1.4e-17 0.08 0.08 0.08 0.08 0.08 2 1.0
tp[19] 0.16 3.9e-17 5.6e-17 0.16 0.16 0.16 0.16 0.16 2 1.0
tp[20] 0.22 3.9e-17 5.6e-17 0.22 0.22 0.22 0.22 0.22 2 1.0
tp[21] 0.22 3.9e-17 5.6e-17 0.22 0.22 0.22 0.22 0.22 2 1.0
tp[22] 0.16 3.9e-17 5.6e-17 0.16 0.16 0.16 0.16 0.16 2 1.0
tp[23] 0.08 9.8e-18 1.4e-17 0.08 0.08 0.08 0.08 0.08 2 1.0
tp[24] 0.03 1.2e-17 1.7e-17 0.03 0.03 0.03 0.03 0.03 2 1.0
tp[25] 4.8e-3 1.2e-18 1.7e-18 4.8e-3 4.8e-3 4.8e-3 4.8e-3 4.8e-3 2 1.0
tp[26] 0.0 nan 0.0 0.0 0.0 0.0 0.0 0.0 nan nan
tp[27] 0.0 nan 0.0 0.0 0.0 0.0 0.0 0.0 nan nan
tp[28] 0.0 nan 0.0 0.0 0.0 0.0 0.0 0.0 nan nan
tp[29] 0.0 nan 0.0 0.0 0.0 0.0 0.0 0.0 nan nan
tp[30] 0.0 nan 0.0 0.0 0.0 0.0 0.0 0.0 nan nan
tp[31] 0.0 nan 0.0 0.0 0.0 0.0 0.0 0.0 nan nan
tp[32] 0.0 nan 0.0 0.0 0.0 0.0 0.0 0.0 nan nan
tp[33] 0.0 nan 0.0 0.0 0.0 0.0 0.0 0.0 nan nan
tp[34] 0.0 nan 0.0 0.0 0.0 0.0 0.0 0.0 nan nan
tp[35] 0.0 nan 0.0 0.0 0.0 0.0 0.0 0.0 nan nan
tp[36] 0.0 nan 0.0 0.0 0.0 0.0 0.0 0.0 nan nan
tp[37] 0.0 nan 0.0 0.0 0.0 0.0 0.0 0.0 nan nan
tp[38] 0.0 nan 0.0 0.0 0.0 0.0 0.0 0.0 nan nan
tp[39] 0.0 nan 0.0 0.0 0.0 0.0 0.0 0.0 nan nan
tp[40] 0.0 nan 0.0 0.0 0.0 0.0 0.0 0.0 nan nan
lp__ 0.0 nan 0.0 0.0 0.0 0.0 0.0 0.0 nan nan

Samples were drawn using Fixed_param at Sun Aug 16 20:40:50 2020.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).

Hey,

Yes your approach is correct, and quite clever at that! Essentially what you are doing is marginalising over K under a uniform prior on the “interval” I = [k, N-(n-k)]. I made a couple changes to your program so you don’t have to loop over the whole array 0:N, since you have positive probability only over I.

Stan program:

data {
  int<lower=1> pop;                   // Population
  int<lower=1, upper=pop> ss;         // Sample size
  int<lower=0, upper=ss> k1;          // Number of successes observed
}
transformed data {
  // The maximum number of successes is the population minus observed failures
    int<lower=0, upper=pop> max_a = pop-(ss-k1);  
  // The number of successes can't be less than what observed in the sample
    int<lower=0, upper=ss> min_a = k1;       
    // Number of possible values a can take
    int<lower=1, upper=pop> len_a = max_a-min_a+1;
    int<lower=min_a, upper=max_a> marginalisation_array[len_a];
    for(a in 1:len_a) marginalisation_array[a] = min_a + a-1;
}
parameters {
}
transformed parameters {
    vector[len_a] lp;
    for(a in 1:len_a){
      lp[a] = hypergeometric_lpmf(k1 | ss, marginalisation_array[a], pop-marginalisation_array[a]) -log(len_a);
    } 
}
model {
    target += log_sum_exp(lp);
}
generated quantities{
  int i;
  simplex[len_a] tp;
  tp = softmax(lp);
  i = categorical_rng(tp);
}

Running

library(rstan)
rstan_options(auto_write = TRUE)

hypergeo_uniform <- stan_model("hypergeo_inference_for_K_marginalisation.stan")

farm.data <- list(
  pop=39,
  ss=30,
  k1=15
)

mcmc <- sampling(hypergeo_uniform, data = farm.data, algorithm = "Fixed_param", chains = 4)
mcmc
K.posterior <- extract(mcmc, 'i')$i

hist(K.posterior + farm.data$k1, xlab = expression(K), main = "Total number of successes")

Gives

Things to do include:

  • Checking to see if I didn’t screw up the indexing in my attempt at optimising the code;
  • Maybe testing what happens when you place a Poisson prior on K, truncated over I.

Hope this helps.

2 Likes

Thank you so much! This is exactly what I needed.