I’d like to estimate the degree of separation/overlap between two distributions and I’m wondering if Stan can do it (yes, probably!). Specifically, I’d like to estimate the quantity min(second distribution) - max(first distribution)
, where min
and max
are tail extrema. The distributions here reflect pairwise genetic distances within and among species, respectively, so they’re always positive.
I can easily specify the data
and generated quantities
block:
data {
int<lower = 0> N; // number of genetic distances
vector<lower = 0, upper = 1>[N] intra; // intraspecific (within-species) genetic distances
vector<lower = 0, upper = 1>[N] inter; // interspecific (among-species) genetic distances
}
generated quantities {
real gap_est;
gap_est = min_inter - max_intra;
}
I’m thinking a mixture model (essentially a density estimation problem) might make sense. The parameters
block could then be:
parameters {
simplex[2] theta; // mixing proportions
real<lower = 0, upper = 1> min_inter; // location of minimum interspecfic distance
real<lower = 0, upper = 1> max_intra; // location of maximum intraspecfic distance
real<lower = 0> sigma_inter; // scale of minimum interspecfic distance
real<lower = 0> sigma_intra; // scale of maximum intraspecfic distance
}
However, I’m stuck on the model
block. After going through the Stan User Guide I’m wondering about:
model {
min_inter ~ normal(0, 1) T[min(inter), max(inter)];
max_intra ~ normal(0, 1) T[min(intra), max(intra)];
sigma_inter ~ exponential(1);
sigma_intra ~ exponential(1);
theta ~ beta(2, 2)
for (i in 1:N) {
intra[i] ~ normal(max_intra, sigma_intra);
inter[i] ~ normal(min_inter, sigma_inter);
}
}
and how to incorporate theta
. Actually, at the moment, my program is not a mixture model at all since it doesn’t specify log_mix()
.
Any ideas on how to got about this easily?