Thanks @ariddell. I’ll take a look at that case. Is it possible to use improper priors like this in a hierarchical model?

@betanalpha: Interesting trick. Do you have any references on that or where to read more? Looks like when nu=1, it would recover the normalized 1/2 stable distribution?

It’s a standard technique for generalizing distributions that’s where many “generalized” distributions arise. With Stan we can just build the joint model and not worry about having to marginalize out the hyperparameters analytically in closed form.

I think the result of lemma 1 in your reference [1] is more of theoretical concern? It’s the exchangeable partition probability function, which is basically the induced pdf over the partitions of N elements, while you are interested in distributions over the simplex which is instead given by the stickbreaking process described in the beginning of section 2. There is one for poisson-Dirichlet which I believe is the same as Pitman-Yor? The issue is that truncation of the infinite stickbreaking process means that the last weight (w_N=1-\sum_{n=1}^Nw_n) has an induced distribution that doesn’t quite match the other weights’. This will mostly be a problem for low-dimensional simplices.

The Pitman-Yor is the same as the two parameter poisson-Dirichlet and has a stickbreaking distribution of Beta(1−a, b+i*a)
where i is the stick index.

My sense looking at that is that it appears to depend on the indexing of the sticks. Particularly that they are in the “size-biased” order. But admittedly, am not totally clear on it and it is possible that it all works out so that the order doesn’t matter?

The stickbreaking processes (at least for DP and Pitman-Yor) were designed to have a rich-get-richer behaviour, so you are right that it will not be symmetric in general. But you argued earlier that you thought the Pitman-Yor might work - it should be implementable in Stan using stickbreaking. If you want symmetry, you could do a random permutation or a mixture, maybe?
Are you trying to encourage your simplex vectors to go to the edges of the simplex? My best guess at your original problem is that you feel that the Dirichlet for \alpha<1 is too quickly overwhelmed by data. You could instead construct a mixture of Dirichlets, with one centered on each corner. Your two parameters could be the concentration around each component mode and the distance from each mode to the corner.

Yes, I’m working with species abundance data which was one of the original uses of the model, before it’s rediscovery by Pitman & Yor [1]. It seems that the stick breaking aspect can be implemented in Stan, but it would also require size biased resampling to recover the Pitman-Yor process, which would be impossible to do in Stan. No?

Exactly, even with \alpha < 1, it seems that the data is not getting close enough to edge of the simplex. Abundant species do not appear abundant enough in the posterior, less abundant species appear too abundant.

@betanalpha That makes sense. One question about parameterization though. It seems like this would be the same model as the normalized generalized inverse Gaussian, but with a different posterior geometry. However, I know that you have written before these normalized models can cause problems. I’m curious why this parameterization works better or are there scenarios when the normalization parameterization would work better or are these not actually the same model?

[1] Engen, S. (1978). Stochastic Abundance Models. (S. Engen, Ed.). London: Chapman and Hall Ltd.

No generic answer. You have to study the interaction of your prior, your observation model, and the range of feasible data to identify which parameterization or model is suitable.

I am unsure exactly what you are asking for with your Pitman-Yor request. If you want the infinite untruncated variant, I think you are out of luck. But the stickbreaking process directly implements the marginalized Pitman-Yor. The truncated version is furthermore a distribution on the simplex. The reason why it is asymmetric is that for the full process you would associate each weight with a dirac delta at a location that is sampled i.i.d., which makes the ordering of the weights uninformative. But of course, if what you want is to use the simplex distribution to encode e.g. a mixture where each component has an informative prior, then the fact that the weights are size-biased is an issue.

Thanks. I do want the truncated version and I was confused about how the stickbreaking process directly implements the marginalized Pitman-Yor. However, I think I’m still confused wow the stickbreaking process directly implements the marginalized Pitman-Yor. In the description by Engen (1978), the sticks are first ordered by a random permutation by picking a point uniformly at random on the unit interval, selecting that segment as q_1, removing the segment, rescaling the remaining segments, repeating to get q_2, and so on until the points are ordered up to q_n. I’ve attached the relevant section from Engen 1978 at the end.

Then these sticks are distributed with \mbox{beta}(\kappa+1,\alpha-\kappa i).

So I’m not sure how to do the permutation step, or how you can justify leaving it out. It seems like this permutation would need to be done at every leapfrog step, but that seems like it does not work with the HMC.

Maybe the difference is this would work with clustering where the cluster labels are interchangeable? I am working with species abundance data, where the species identities are not interchangeble.

where X_k^*\sim G_0 would be the mixture component parameters if this was a DP or PY mixture, with G_0 being the prior. The stickbreaking comes from defining

where z_k is the stickbreaking proportion, and H_k is a measure on [0,1]. It is \beta(1,\alpha) for the Dirichlet process for instance.

Now I think your confusion comes from the fact that G(\cdot) does not depend on the ordering of the \omega_k — we can do any permutation of the indices without changing the resulting measure. In particular, we can define the stickbreaking process to sample the \omega_k in a size-biased order, without that affecting G.

Now, if you truncate the stickbreaking process by setting z_K=1, the vector w=(\omega_1,...,\omega_K) defines a point on the simplex with weights drawn according to the truncated Pitman-Yor process, but it will be size-biased.

Hmm, coming to the end of this I think I can agree with you that correcting fully for the size-biased ordering is not possible without summing over all permutations. If stan allowed discrete sampling we could do random sampling of permutations, but without that option we have to sum/integrate.

Okay, as I’ve been reading more about it, I’m not entirely sure I was correct. From the residual allocation model in [1], it appears that any stick-breaking law of the form \mbox{Beta}(a,b+i(1-a)) will be symmetric. This is because if you have two subsequent breaks, and the first is distributed \mbox{Beta}(a,b) and the second is distributed \mbox{Beta}(a,b+1-a)), then the joint law is symmetric.

With the Dirichlet recovered when a=1. And the Pitman-Yor recovered when a=1-\alpha and b=\theta.

So as you said originally, maybe it does not depend on the ordering, even though the joint law is presented as summing over the orderings[2]? Perhaps I should have stopped reading when I understood what has going on.

Wow, that’s a much stronger result than I thought was available… And very much counterintuitive; we are biasing by size, but the joint distribution is not biased by size?

Truncation will mess that up to a degree though.

edit: also, now truncation seems increasingly problematic - ordinarily you rely on the weights becoming smaller as the stickbreaking procedure continues, but with this result isn’t the last stick equally likely to be large?

Is there something one could do with an ordered simplex variable? (Stan has an ordered variable type, which could be used for such a purpose.)

I realize an ordered simplex is not exactly what is wanted here – but I wonder if we might be able to get rather close to the desired truncated representation.

Yes, it seems unintuitive to me as well! We are biasing by size since there is less stick to break off at each step, and the second parameter of the beta distribution increases at each step.

I think you are correct about how truncation messes this up. The p + q < 1 constraint means that the penultimate stick cannot be swapped with the last stick.

However, it would mean that you only need to sum over the N possible last segments, rather than the N! size biased permutations. It doesn’t quite match up with my intuition, but as far as I can tell, it does seem to be what the math says…

I thought about something like this, but thought it would be difficult to identify. Now that I have the term for it, it does look like there is some literature on these types of models though. Any pointers on where to get started? My sense is that these would end up with models would marginalize to something quite similar to what Mike was suggesting?

Theorem 2 in the Pitman paper seems to have some kind of result on finite-dimensional random measures, but I don’t know whether it corresponds to truncation directly.

“nested Dirichlet” isn’t a technical term. I’ve seen “Dirichlet Compound
Multinomial” used in one place. One paper where it appears is:

Doyle, Gabriel, and Charles Elkan. “Accounting for burstiness in topic
models.” In Proceedings of the 26th Annual International Conference on
Machine Learning, pp. 281-288. ACM, 2009.

In general, my sense is that Gibbs sampling really shines in this
particular area (PYP). But writing the code takes a long time.

Stan only requires proper posteriors. There’s nothing to check that the posterior is proper, but Stan tends to run off the cliff when the posterior’s improper, so we tend to get very early diagnostics in the form of parametes with posterior means of +/- 1e+300.

When I wrote the transform for the simplex, I set it up so that (0, …0) (K - 1 terms) on the unconstrained scale translates to (1/K, …, 1/K) (K terms) on the constrained scale.

Truncation messes up the terms going to 1 / epsilon with epsilon → 0 in the limit, but otherwise, you’ll never get more clusters than data points, so you can go beyond your number of data points to get a conservative bin estimate. That just might lead to challenging computation.

You can look at the logistic multinormal or you can do same thing with Student-t. It also lets you control covariance, but if you don’t want that, you can simplify computation by making the covariance diagonal

z ~ multi_student_t(nu, mu, Sigma);
theta = softmax(z);

Of course, if the covariance Sigma = diag_matrix(sigma), then this can be much more efficiently implemented as:

z ~ student_t(nu, rep_vector(0, K), sigma);

where sigma is the overall scale of variation of the log odds and nu is degrees of freedom controlling dispersion of Student-t. It’s even more efficient if sigma = rep_vector(tau, K), because then a scalar can be used for sigma in the prior for z.