Outlier detection for Dirichlet-distributed data


Suppose I have a set of points X that I model with a Gaussian distribution. If I want to assess how a new point Y fits this distribution, I can compute the probability of observing at least as extreme a point under that distribution (e.g. via a z-score).

In a multivariate normal setting, one can compute the probability in the region outside the contour line passing through Y or, from what I read, one can use the Mahalanobis distance (instead of the z-score).

Now, for data that live on the simplex, I suppose the contour idea still applies. Are there analytic solutions for computing that probability, or similar distance measures?

Many thanks!

1 Like

Hello! Kindly let me know if you’ve already considered this, but if you’re looking to find an analogy for confidence intervals for Dirichlet distributions, would the entropy measure be of use? Or perhaps a point estimate if you’re looking to find the probability of an outlier occurring?

1 Like


Thank you very much for your reply! I am not sure how the entropy way you suggest would be implemented. Could you please elaborate on this?

Reflecting upon it, I realised that I came into this actually thinking along the lines of getting 1 - CDF for a Dirichlet. Its CDF is not trivial, but a Monte Carlo approximation approximation may do the trick. Indeed, MC can also yield 1 - CDF for the multivariate normal case.

Any thoughts?

Notice that the CDF that you’re suggesting for the Dirichlet might interact non-intuitively with some Dirichlet distributions. Notice that Dirichlet PDFs can be mutimodal, with troughs rather than peaks in the middle. Put differently, the contour lines that you refer to in the original post need not be connected to one another, and you don’t know a priori which side of a contour line will have higher versus lower probability density. (some Dirichlets have contours of constant probability density that form closed loops with low probability density inside and high probability density outside).

That’s a good point, thank you! It seems to support the idea of MC estimates to me.