Estimating separation/overlap between distributions

I’d like to estimate the degree of separation/overlap between two distributions and I’m wondering if Stan can do it (yes, probably!). Specifically, I’d like to estimate the quantity min(second distribution) - max(first distribution), where min and max are tail extrema. The distributions here reflect pairwise genetic distances within and among species, respectively, so they’re always positive.

I can easily specify the data and generated quantities block:

data {
  int<lower = 0> N; // number of genetic distances
  vector<lower = 0, upper = 1>[N] intra; // intraspecific (within-species) genetic distances
  vector<lower = 0, upper = 1>[N] inter; // interspecific (among-species) genetic distances
}

generated quantities {
  real gap_est;
  gap_est = min_inter - max_intra;
}

I’m thinking a mixture model (essentially a density estimation problem) might make sense. The parameters block could then be:

parameters {
  simplex[2] theta; // mixing proportions
  real<lower = 0, upper = 1> min_inter; // location of minimum interspecfic distance
  real<lower = 0, upper = 1> max_intra; // location of maximum intraspecfic distance
  real<lower = 0> sigma_inter;  // scale of minimum interspecfic distance
  real<lower = 0> sigma_intra;  // scale of maximum intraspecfic distance
}

However, I’m stuck on the model block. After going through the Stan User Guide I’m wondering about:

model {
  min_inter ~ normal(0, 1) T[min(inter), max(inter)];
  max_intra ~ normal(0, 1) T[min(intra), max(intra)];
  sigma_inter ~ exponential(1);
  sigma_intra ~ exponential(1);
  theta ~ beta(2, 2) 

  for (i in 1:N) {
    intra[i] ~ normal(max_intra, sigma_intra);
    inter[i] ~ normal(min_inter, sigma_inter);
  }
}

and how to incorporate theta. Actually, at the moment, my program is not a mixture model at all since it doesn’t specify log_mix().

Any ideas on how to got about this easily?

What does min(distribution) mean? If I have a normal distribution, there’s technically no minimum value. With an exponential distribution, 0 is the infimum, but it’s not technically in support.

Your data seems to be making this empirical. So why isn’t the answer max(inter) - min(intra) or something like that? Of course, that doesn’t require any sampling.

What do you think min_inter ~ normal(0, 1) T[...] is doing here? Also, presumably those bounds are just empirical bounds, so probably shouldn’t be part of the sampling.

Is the idea that you have two sets of points, X1 and X2, and want to fit a distribution to X1 and then a distribution to X2 and then compare tail statistics? That you can do with Stan. If you want to do it with a normal distribution, that’d just be:

parameters {
  real mu1, mu2;
  real<lower=0> sigma1, sigma2;
}
model {
  x1 ~ normal(mu1, sigma1);
  x2 ~ normal(mu2, sigma2);
}
generated quantities {
  real q1_05 = mu1 + invPhi(0.05) * sigma1;
  real q2_95 = mu2 + invPhi(0.95) * sigma2;
}

This just uses the quantile function invPhi() to pull out tail statistics of the fitted normal distributions. Unfortunately, Stan doesn’t have inverse cdf functions for other distributions. And even here, we only have it for the standard normal, so have to scale and offset.

1 Like

Thanks @Bob_Carpenter.

I had been playing with the model block and realize now that my priors are nonsensical since the normal and exponential are continuous PDFs.

Your proposal seems legitimate, but the two distributions I have are skewed for most cases. Regardless, I’ll try it out.

Also invPhi should be inv_Phi according to the Stan documentation.