Theorem 2 in the Pitman paper seems to have some kind of result on finite-dimensional random measures, but I don’t know whether it corresponds to truncation directly.

# Two parameter distribution over the simplex

**ariddell**#22

“nested Dirichlet” isn’t a technical term. I’ve seen “Dirichlet Compound

Multinomial” used in one place. One paper where it appears is:

Doyle, Gabriel, and Charles Elkan. “Accounting for burstiness in topic

models.” In Proceedings of the 26th Annual International Conference on

Machine Learning, pp. 281-288. ACM, 2009.

In general, my sense is that Gibbs sampling really shines in this

particular area (PYP). But writing the code takes a long time.

I hope you find something that works.

**Bob_Carpenter**#23

Stan only requires proper posteriors. There’s nothing to check that the posterior is proper, but Stan tends to run off the cliff when the posterior’s improper, so we tend to get very early diagnostics in the form of parametes with posterior means of +/- 1e+300.

When I wrote the transform for the simplex, I set it up so that (0, …0) (K - 1 terms) on the unconstrained scale translates to (1/K, …, 1/K) (K terms) on the constrained scale.

Truncation messes up the terms going to 1 / epsilon with epsilon -> 0 in the limit, but otherwise, you’ll never get more clusters than data points, so you can go beyond your number of data points to get a conservative bin estimate. That just might lead to challenging computation.

**Bob_Carpenter**#24

You can look at the logistic multinormal or you can do same thing with Student-t. It also lets you control covariance, but if you don’t want that, you can simplify computation by making the covariance diagonal

```
z ~ multi_student_t(nu, mu, Sigma);
theta = softmax(z);
```

Of course, if the covariance `Sigma = diag_matrix(sigma)`

, then this can be much more efficiently implemented as:

```
z ~ student_t(nu, rep_vector(0, K), sigma);
```

where `sigma`

is the overall scale of variation of the log odds and `nu`

is degrees of freedom controlling dispersion of Student-t. It’s even more efficient if `sigma = rep_vector(tau, K)`

, because then a scalar can be used for `sigma`

in the prior for `z`

.