How to estimate the Poisson parameter after seeing max(n samples)

Hi!

I have a process that draws a fixed number (n) of samples from a Poisson distribution with unknown lambda.

What I observe is a maximum (or in general any other order statistic) of those n draws.

How do I infer the lambda parameter?

In more general terms, can Stan calculate the k-th order distribution and use it to infer the parameters?

You can apply the order statistic formula to figure out the PDF/likelihood for the maximum, which you could choose to model in Stan as something like:

data {
  int n_obs;
  int y_oberserved_max;
}

parameters {
  real <lower = 0> lambda;
}

model {
  target += log(n_obs) + (n_obs - 1) * poisson_lcdf(observed_max | lambda) +
    poisson_lpmf(observed_max | lambda);
}```
1 Like

That’s a very good lead, thank you.

Here’s my code, which I have numerically checked that is consistent with your code, but can be applied to any order statistic:

data {
  int<lower=0> N;
  array[N] int<lower=0> cell_values;
  int n; //number of trials in the cell
  int k; //which order statistic we see (e.g. n if the max)
}

parameters {
  real<lower=0> lambda;
}

model {
  for (i in 1:N) {
      target += lgamma(n+1) - lgamma(k) - lgamma(n-k+1) + 
      (k-1)*poisson_lcdf(cell_values[i] | lambda) + 
      (n-k)*poisson_lccdf(cell_values[i] | lambda) + 
      poisson_lpmf(cell_values[i] | lambda);
  }
}

I have a problem though: the Bayesian solution is consistently biased.

Here’s a driver I use:

generative_model_4<-function(n, k, lambda, N) {
  # Randomizes an array of size n containing number of particles using Poisson distribution with assumed lambda
  data <- rpois(n * N, lambda)
  dim(data) <- c(n, N)

  for(i in 1:N) {
    data[,i] <- sort(data[,i])
  }

  return(list(N=N, cell_values=data[k,], n=n, k=k))
}

data <- generative_model_4(15,15, 0.15, 100)
data
# Run the model
fit <- sampling(model, data=data, iter=10000, chains=1, seed=12346)

fit

With output:

> fit
Inference for Stan model: anon_model.
1 chains, each with iter=1000; warmup=500; thin=1; 
post-warmup draws per chain=500, total post-warmup draws=500.

        mean se_mean   sd  2.5%   25%   50%   75% 97.5% n_eff Rhat
lambda  0.25    0.00 0.02  0.21  0.23  0.25  0.26  0.29   150    1
lp__   27.77    0.05 0.73 25.50 27.64 28.03 28.22 28.28   198    1

Samples were drawn using NUTS(diag_e) at Wed Oct  4 16:27:59 2023.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

Clearly, the reconstructed lambda is lower than the actual one. It is not a bad luck - the inferred lambda is consistently lower above the actual value for all the input combination I have tried.

The situation is much worse for other order statistics than the max.

What do I do wrong?

I don’t think this has anything to do with the Bayesian approach to estimation. If you simulate data with lambda = 50 and n = k = N = 5000, then I get back a maximum likelihood estimate of approximately 50.3 over multiple replicates (using e.g. optimizing inside rstan). This is much less relative bias, but more absolute bias, than lambda = 0.15. This is not entirely surprising – we have the order-statistics density/mass function as our likelihood, and the maximum likelihood estimator is not unbiased.

I wrote some code to run test a wide range of values that control data generation, but I get some idiosyncratic computational errors for certain combinations. Maybe it will be helpful for you anyway.
ordered-statistics-likelihood.R (2.2 KB)

2 Likes

If lambda is the same between trials then you can just aggregate across all the N and get the k^{th} order statistic. Getting order statistics for discrete distributions is a bit different from what you have. See Order statistic - Wikipedia.

The max is the easier one so I’ll just focus on that. Here’s code to get the max for a large sample

data {
  int<lower=0> max_value;
  int n; //number of trials in the cell
  int k; //which order statistic we see (e.g. n if the max)
}
parameters {
  real<lower=0> lambda;
}
model {
   real log_cache = poisson_lcdf(max_value | lambda);
   target += log_diff_exp(n * log_cache, n * log_diff_exp(log_cache, poisson_lpmf(max_value | lambda)));
}

Let’s say we do a similar simulation but we aggregate all the samples and just observe the max.

n <- 15
N <- 15 * 100
data <- rpois(n * N, lambda = 0.15)

order_stat_mod <- cmdstan_model("order_stat_forum_solution.stan")

mod_out <- order_stat_mod$sample(
  data = list(
    max_value = max(data),
    n = n * N,
    k = n * N
  ),
  parallel_chains = 4,
  adapt_delta = 0.9
)
mod_out$summary("lambda")

Note that I had to increase adapt delta a bit.

The output is

  variable  mean median     sd    mad     q5   q95  rhat ess_bulk ess_tail
  <chr>    <num>  <num>  <num>  <num>  <num> <num> <num>    <num>    <num>
1 lambda   0.122  0.115 0.0479 0.0488 0.0518 0.208  1.00     916.     895.

If you had many samples and you thought that there was variability in lambda or you wanted to do something else like a hierarchical model you can put N back in like so

data {
  int<lower=0> N;
  array[N] int<lower=0> cell_values;
  int n; //number of trials in the cell
  int k; //which order statistic we see (e.g. n if the max)
}

parameters {
  real<lower=0> lambda;
}
model {

 for (i in 1:N) {
   real log_cache = poisson_lcdf(cell_values[i] | lambda);
   target += log_diff_exp(n * log_cache, n * log_diff_exp(log_cache, poisson_lpmf(cell_values[i] | lambda)));
 }
2 Likes