Two parameter distribution over the simplex

I am working with some compositional data and looking for the appropriate prior distribution over the simplex. It appears that the dirichlet distribution does not allow heavy enough tails for the data and am looking at other choices.

I have come across several alternative 1 parameter distributions that I can implement in Stan, namely the normalized inverse Gaussian, and the normalized 1/2-stable distribution. However, I am looking for a two parameter distribution that nests the Dirichlet as a special case.

The normalized generalized Gamma, normalized stable distribution, Pitman-Yor type process distribution fit the bill, but do not appear to have analytic solutions to the density function.

Is there a way to implement these in Stan? Is there an alternative distribution that might be suitable?

2 Likes

I did come across this post by @ariddell on the old forum suggesting it was possible to implement the truncated Pitman-Yor process, which would work. However, as far as I can tell eg from lemma 1 of [1], the p.d.f. would involve a summation over the permutation of the classes. Though perhaps there is a simpler way to compute this?

[1] RodrĆ­guez, A., & Quintana, F. A. (2015). On species sampling sequences induced by residual allocation models. Journal of Statistical Planning and Inference, 157, 108ā€“120. http://doi.org/10.1016/j.jspi.2014.08.008

If you want to add additional dispersion then make the Dirichlet scale itself a hyperparameter,

theta ~ dirichlet(alpha);
alpha ~ inv_gamma(0.5 * nu, 0.5 * nu * s * s);

Depending on what prior you give to alpha you get recover an variety of interesting marginal behaviors for `theta`.
1 Like

My reference for Pitman-Yor / Poisson-Dirichlet stuff is Buntine, Wray, and Marcus Hutter. 2010. ā€œA Bayesian View of the Poisson-Dirichlet Process.ā€ ArXiv:1007.0296 [Cs, Math, Stat], July. http://arxiv.org/abs/1007.0296.

Thereā€™s a truncated formulation that uses an improper Dirichlet. I think it could be adapted to work in Stan. It starts on p. 26.

This Hutter piece might also have something useful: Hutter, Marcus. 2013. Sparse Adaptive Dirichlet- Multinomial-like Processes.

Thanks @ariddell. Iā€™ll take a look at that case. Is it possible to use improper priors like this in a hierarchical model?

@betanalpha: Interesting trick. Do you have any references on that or where to read more? Looks like when nu=1, it would recover the normalized 1/2 stable distribution?

re: improper priors: I donā€™t think itā€™s impossible. Stan math might not let you do it right now.

Itā€™s a standard technique for generalizing distributions thatā€™s where many ā€œgeneralizedā€ distributions arise. With Stan we can just build the joint model and not worry about having to marginalize out the hyperparameters analytically in closed form.

1 Like

I think the result of lemma 1 in your reference [1] is more of theoretical concern? Itā€™s the exchangeable partition probability function, which is basically the induced pdf over the partitions of N elements, while you are interested in distributions over the simplex which is instead given by the stickbreaking process described in the beginning of section 2. There is one for poisson-Dirichlet which I believe is the same as Pitman-Yor? The issue is that truncation of the infinite stickbreaking process means that the last weight (w_N=1-\sum_{n=1}^Nw_n) has an induced distribution that doesnā€™t quite match the other weightsā€™. This will mostly be a problem for low-dimensional simplices.

The Pitman-Yor is the same as the two parameter poisson-Dirichlet and has a stickbreaking distribution of
Beta(1āˆ’a, b+i*a)
where i is the stick index.

My sense looking at that is that it appears to depend on the indexing of the sticks. Particularly that they are in the ā€œsize-biasedā€ order. But admittedly, am not totally clear on it and it is possible that it all works out so that the order doesnā€™t matter?

The stickbreaking processes (at least for DP and Pitman-Yor) were designed to have a rich-get-richer behaviour, so you are right that it will not be symmetric in general. But you argued earlier that you thought the Pitman-Yor might work - it should be implementable in Stan using stickbreaking. If you want symmetry, you could do a random permutation or a mixture, maybe?
Are you trying to encourage your simplex vectors to go to the edges of the simplex? My best guess at your original problem is that you feel that the Dirichlet for \alpha<1 is too quickly overwhelmed by data. You could instead construct a mixture of Dirichlets, with one centered on each corner. Your two parameters could be the concentration around each component mode and the distance from each mode to the corner.

Yes, Iā€™m working with species abundance data which was one of the original uses of the model, before itā€™s rediscovery by Pitman & Yor [1]. It seems that the stick breaking aspect can be implemented in Stan, but it would also require size biased resampling to recover the Pitman-Yor process, which would be impossible to do in Stan. No?

Exactly, even with \alpha < 1, it seems that the data is not getting close enough to edge of the simplex. Abundant species do not appear abundant enough in the posterior, less abundant species appear too abundant.

@betanalpha That makes sense. One question about parameterization though. It seems like this would be the same model as the normalized generalized inverse Gaussian, but with a different posterior geometry. However, I know that you have written before these normalized models can cause problems. Iā€™m curious why this parameterization works better or are there scenarios when the normalization parameterization would work better or are these not actually the same model?

[1] Engen, S. (1978). Stochastic Abundance Models. (S. Engen, Ed.). London: Chapman and Hall Ltd.

No generic answer. You have to study the interaction of your prior, your observation model, and the range of feasible data to identify which parameterization or model is suitable.

I am unsure exactly what you are asking for with your Pitman-Yor request. If you want the infinite untruncated variant, I think you are out of luck. But the stickbreaking process directly implements the marginalized Pitman-Yor. The truncated version is furthermore a distribution on the simplex. The reason why it is asymmetric is that for the full process you would associate each weight with a dirac delta at a location that is sampled i.i.d., which makes the ordering of the weights uninformative. But of course, if what you want is to use the simplex distribution to encode e.g. a mixture where each component has an informative prior, then the fact that the weights are size-biased is an issue.

Thanks. I do want the truncated version and I was confused about how the stickbreaking process directly implements the marginalized Pitman-Yor. However, I think Iā€™m still confused wow the stickbreaking process directly implements the marginalized Pitman-Yor. In the description by Engen (1978), the sticks are first ordered by a random permutation by picking a point uniformly at random on the unit interval, selecting that segment as q_1, removing the segment, rescaling the remaining segments, repeating to get q_2, and so on until the points are ordered up to q_n. Iā€™ve attached the relevant section from Engen 1978 at the end.

Then these sticks are distributed with \mbox{beta}(\kappa+1,\alpha-\kappa i).

So Iā€™m not sure how to do the permutation step, or how you can justify leaving it out. It seems like this permutation would need to be done at every leapfrog step, but that seems like it does not work with the HMC.

Maybe the difference is this would work with clustering where the cluster labels are interchangeable? I am working with species abundance data, where the species identities are not interchangeble.

I might be misunderstanding your question, but Iā€™ll attempt to ground us in some formulas hereā€¦

Going back to your reference [1], section 2 gives the general stickbreaking process by defining the random measure

G(\cdot)=\sum_{k=1}^\infty\omega_k\delta_{X_k^*}(\cdot)

where X_k^*\sim G_0 would be the mixture component parameters if this was a DP or PY mixture, with G_0 being the prior. The stickbreaking comes from defining

\omega_k=z_k\prod_{\ell=1}^{k-1}(1-z_k),\quad z_k\sim H_k

where z_k is the stickbreaking proportion, and H_k is a measure on [0,1]. It is \beta(1,\alpha) for the Dirichlet process for instance.

Now I think your confusion comes from the fact that G(\cdot) does not depend on the ordering of the \omega_k ā€” we can do any permutation of the indices without changing the resulting measure. In particular, we can define the stickbreaking process to sample the \omega_k in a size-biased order, without that affecting G.

Now, if you truncate the stickbreaking process by setting z_K=1, the vector w=(\omega_1,...,\omega_K) defines a point on the simplex with weights drawn according to the truncated Pitman-Yor process, but it will be size-biased.

Hmm, coming to the end of this I think I can agree with you that correcting fully for the size-biased ordering is not possible without summing over all permutations. If stan allowed discrete sampling we could do random sampling of permutations, but without that option we have to sum/integrate.

1 Like

Okay, as Iā€™ve been reading more about it, Iā€™m not entirely sure I was correct. From the residual allocation model in [1], it appears that any stick-breaking law of the form \mbox{Beta}(a,b+i(1-a)) will be symmetric. This is because if you have two subsequent breaks, and the first is distributed \mbox{Beta}(a,b) and the second is distributed \mbox{Beta}(a,b+1-a)), then the joint law is symmetric.

With the Dirichlet recovered when a=1. And the Pitman-Yor recovered when a=1-\alpha and b=\theta.

So as you said originally, maybe it does not depend on the ordering, even though the joint law is presented as summing over the orderings[2]? Perhaps I should have stopped reading when I understood what has going on.

[1] Pitman, J. (1996). Random discrete distributions invariant under size-biased permutation. Advances in Applied Probability, 28(2), 525ā€“539. Random discrete distributions invariant under size-biased permutation | Advances in Applied Probability | Cambridge Core
[2] RodrĆ­guez, A., & Quintana, F. A. (2015). On species sampling sequences induced by residual allocation models. Journal of Statistical Planning and Inference, 157, 108ā€“120. Redirecting

Wow, thatā€™s a much stronger result than I thought was availableā€¦ And very much counterintuitive; we are biasing by size, but the joint distribution is not biased by size?

Truncation will mess that up to a degree though.

edit: also, now truncation seems increasingly problematic - ordinarily you rely on the weights becoming smaller as the stickbreaking procedure continues, but with this result isnā€™t the last stick equally likely to be large?

Is there something one could do with an ordered simplex variable? (Stan has an ordered variable type, which could be used for such a purpose.)

I realize an ordered simplex is not exactly what is wanted here ā€“ but I wonder if we might be able to get rather close to the desired truncated representation.

I may have missed it, but have you considered a nested Dirichlet? I think this has all sorts of good properties.

Something like:

theta ~ dirichlet(nu * alpha);
alpha ~ dirichlet(beta);
nu ~ gamma(a, b);

Yes, it seems unintuitive to me as well! We are biasing by size since there is less stick to break off at each step, and the second parameter of the beta distribution increases at each step.

This is the relevant bit from the Pitman paper.

I think you are correct about how truncation messes this up. The p + q < 1 constraint means that the penultimate stick cannot be swapped with the last stick.

However, it would mean that you only need to sum over the N possible last segments, rather than the N! size biased permutations. It doesnā€™t quite match up with my intuition, but as far as I can tell, it does seem to be what the math saysā€¦

I thought about something like this, but thought it would be difficult to identify. Now that I have the term for it, it does look like there is some literature on these types of models though. Any pointers on where to get started? My sense is that these would end up with models would marginalize to something quite similar to what Mike was suggesting?