# How to estimate the Poisson parameter after seeing max(n samples)

Hi!

I have a process that draws a fixed number (n) of samples from a Poisson distribution with unknown lambda.

What I observe is a maximum (or in general any other order statistic) of those n draws.

How do I infer the lambda parameter?

In more general terms, can Stan calculate the k-th order distribution and use it to infer the parameters?

You can apply the order statistic formula to figure out the PDF/likelihood for the maximum, which you could choose to model in Stan as something like:

``````data {
int n_obs;
int y_oberserved_max;
}

parameters {
real <lower = 0> lambda;
}

model {
target += log(n_obs) + (n_obs - 1) * poisson_lcdf(observed_max | lambda) +
poisson_lpmf(observed_max | lambda);
}`````````
1 Like

That’s a very good lead, thank you.

Here’s my code, which I have numerically checked that is consistent with your code, but can be applied to any order statistic:

``````data {
int<lower=0> N;
array[N] int<lower=0> cell_values;
int n; //number of trials in the cell
int k; //which order statistic we see (e.g. n if the max)
}

parameters {
real<lower=0> lambda;
}

model {
for (i in 1:N) {
target += lgamma(n+1) - lgamma(k) - lgamma(n-k+1) +
(k-1)*poisson_lcdf(cell_values[i] | lambda) +
(n-k)*poisson_lccdf(cell_values[i] | lambda) +
poisson_lpmf(cell_values[i] | lambda);
}
}
``````

I have a problem though: the Bayesian solution is consistently biased.

Here’s a driver I use:

``````generative_model_4<-function(n, k, lambda, N) {
# Randomizes an array of size n containing number of particles using Poisson distribution with assumed lambda
data <- rpois(n * N, lambda)
dim(data) <- c(n, N)

for(i in 1:N) {
data[,i] <- sort(data[,i])
}

return(list(N=N, cell_values=data[k,], n=n, k=k))
}

data <- generative_model_4(15,15, 0.15, 100)
data
# Run the model
fit <- sampling(model, data=data, iter=10000, chains=1, seed=12346)

fit
``````

With output:

``````> fit
Inference for Stan model: anon_model.
1 chains, each with iter=1000; warmup=500; thin=1;
post-warmup draws per chain=500, total post-warmup draws=500.

mean se_mean   sd  2.5%   25%   50%   75% 97.5% n_eff Rhat
lambda  0.25    0.00 0.02  0.21  0.23  0.25  0.26  0.29   150    1
lp__   27.77    0.05 0.73 25.50 27.64 28.03 28.22 28.28   198    1

Samples were drawn using NUTS(diag_e) at Wed Oct  4 16:27:59 2023.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).
``````

Clearly, the reconstructed lambda is lower than the actual one. It is not a bad luck - the inferred lambda is consistently lower above the actual value for all the input combination I have tried.

The situation is much worse for other order statistics than the max.

What do I do wrong?

I don’t think this has anything to do with the Bayesian approach to estimation. If you simulate data with `lambda = 50` and `n = k = N = 5000`, then I get back a maximum likelihood estimate of approximately `50.3` over multiple replicates (using e.g. `optimizing` inside `rstan`). This is much less relative bias, but more absolute bias, than `lambda = 0.15`. This is not entirely surprising – we have the order-statistics density/mass function as our likelihood, and the maximum likelihood estimator is not unbiased.

I wrote some code to run test a wide range of values that control data generation, but I get some idiosyncratic computational errors for certain combinations. Maybe it will be helpful for you anyway.
ordered-statistics-likelihood.R (2.2 KB)

1 Like

If lambda is the same between trials then you can just aggregate across all the `N` and get the k^{th} order statistic. Getting order statistics for discrete distributions is a bit different from what you have. See Order statistic - Wikipedia.

The max is the easier one so I’ll just focus on that. Here’s code to get the max for a large sample

``````data {
int<lower=0> max_value;
int n; //number of trials in the cell
int k; //which order statistic we see (e.g. n if the max)
}
parameters {
real<lower=0> lambda;
}
model {
real log_cache = poisson_lcdf(max_value | lambda);
target += log_diff_exp(n * log_cache, n * log_diff_exp(log_cache, poisson_lpmf(max_value | lambda)));
}
``````

Let’s say we do a similar simulation but we aggregate all the samples and just observe the max.

``````n <- 15
N <- 15 * 100
data <- rpois(n * N, lambda = 0.15)

order_stat_mod <- cmdstan_model("order_stat_forum_solution.stan")

mod_out <- order_stat_mod\$sample(
data = list(
max_value = max(data),
n = n * N,
k = n * N
),
parallel_chains = 4,
)
mod_out\$summary("lambda")
``````

The output is

``````  variable  mean median     sd    mad     q5   q95  rhat ess_bulk ess_tail
<chr>    <num>  <num>  <num>  <num>  <num> <num> <num>    <num>    <num>
1 lambda   0.122  0.115 0.0479 0.0488 0.0518 0.208  1.00     916.     895.
``````

If you had many samples and you thought that there was variability in lambda or you wanted to do something else like a hierarchical model you can put N back in like so

``````data {
int<lower=0> N;
array[N] int<lower=0> cell_values;
int n; //number of trials in the cell
int k; //which order statistic we see (e.g. n if the max)
}

parameters {
real<lower=0> lambda;
}
model {

for (i in 1:N) {
real log_cache = poisson_lcdf(cell_values[i] | lambda);
target += log_diff_exp(n * log_cache, n * log_diff_exp(log_cache, poisson_lpmf(cell_values[i] | lambda)));
}
``````
1 Like