I am trying to estimate an “overlap” probability from two distributions. Basically, I count the number of observations less than or equal to some cutoff. I treat the counts as binomial, but other discrete distributions are possible.
I’ve computed MLEs (Maximum Likelihood Estimates) and now want to see how Stan fares. However, the MLE is much much larger than any of the posterior draws. My problem appears quite similar to:
For instance, in my case (K
= 1) the MLE for the count (see code below) is y_upr = 654
, but the posterior only ranges from 0-13. Further (below), N
is 25 and M
is 269692. intra
ranges from 0-0.01667 and inter
ranges from 0-0.17667. A posterior predictive check check yields a mean of only 0.975 counts.
Here is a plot:
I can’t really inject prior information into the model, because I have none. I tried to model counts as both a Poisson and a negative binomial, since these have unbounded support. However, no luck.
Below is my Stan program:
data {
int<lower = 1> K; // number of species in genus
int<lower = 1> N[K]; // number of intraspecific (within-species) genetic distances for each species
int<lower = 1> M; // number of interspecific (among-species) genetic distances for all species
array[K] row_vector<lower = 0, upper = 1>[max(N)] intra; // intraspecific genetic distances for each species
vector<lower = 0, upper = 1>[M] inter; // interspecific genetic distances for all species
}
transformed data {
int<lower = 0, upper = M> y_upr[K]; // count of interspecific distances for all species less than or equal to max_intra for each species
real<lower=0, upper=1> max_intra[K]; // maximum intraspecific distance for each species
for (k in 1:K) {
max_intra[k] = max(intra[k, 1:N[k]]);
y_upr[k] = 0;
y_upr[k] += (inter[k] <= max_intra[k]); // count interspecific distances for all species less than or equal to max_intra for each species
}
}
parameters {
real<lower = 0, upper = 1> p_upr[K]; // parameter representing the proportional overlap/separation between interspecific distances for all species and intraspecific distances for each species
}
model {
for (k in 1:K) {
// prior
p_upr[k] ~ uniform(0, 1);
// likelihood
y_upr[k] ~ binomial(M, p_upr[k]); // likelihood for interspecific distances equalling or falling below max_intra
}
}
generated quantities {
// Posterior Predictive Checks
int post_y_upr[K];
for (k in 1:K) {
post_y_upr[k] = binomial_rng(M, p_upr[k]);
}
}
Any other ideas here? Happy to clarify if needed.