Log-Simplex Constraints

Thanks, @spinkney.

The issue is that by setting the last value to 0 all the other values are forced to vary more strongly to accommodate the simplex constraint. The sampler struggles when the curvature of the region varies strongly and different parameterizations induce stronger or weaker curvature for the sampler to explore. The expanded softmax and ILR approaches are meant to distribute the simplex constraint evenly across every value to make sampling easier.

That part makes sense.

When converting the simplex distribution to a log simplex distribution there will be an extra -log_theta[N]. Fortunately, this cancels out from the mapping theta from the vector Euclidean space to the simplex space since we have the log_theta[N] term there.

I obviously haven’t done the derivations for the ExpandedSoftmax or ILR approaches, so I’ll take your word for the Jacobian adjustment there :) I don’t still don’t understand where the log_theta[N] is coming from for the log-simplex form of the Dirichlet distribution. Here’s my derivation of the Jacobian correction for it:

The transform and inverse-transform relative to Dirichlet-distributed simplex \mathbf{x} are as follows, respectively:

\begin{align} f(\mathbf{x}) &= \ln{\mathbf{x}} = \mathbf{y} \\ f^{-1}(\mathbf{y}) &= \text{e}^{\mathbf{y}} = \mathbf{x}. \end{align}

The Jacobian correction of the inverse function can be calculated as follows:

\begin{align} J_{f^{-1}} = \begin{bmatrix} \frac{\partial x_1}{\partial y_1} & \dotsm & \frac{\partial x_1}{\partial y_K} \\ \vdots & \ddots & \vdots \\ \frac{\partial x_K}{\partial y_1} & \dotsm & \frac{\partial x_K}{\partial y_K} \end{bmatrix} =\begin{bmatrix} \text{e}^{y_1} & 0 & \dotsm & 0 \\ 0 & \text{e}^{y_2} & \dotsm & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \dotsm & \text{e}^{y_K} \end{bmatrix} \\ \end{align}

The determinant of a diagonal matrix is just the product of the diagonal elements, so the Jacobian adjustment is

\begin{align} \left|\text{det }J_{f^{-1}}(\mathbf{y})\right| = \prod_{k=1}^K\text{e}^{y_k} \end{align}

Putting everything together, the probability for the exponential-Dirichlet distribution is

\begin{align} P_{\mathbf{y}}(\mathbf{y} | \boldsymbol{\alpha}) = P_{\mathbf{x}}(\text{e}^\mathbf{y} | \boldsymbol{\alpha}) \prod_{k=1}^K\text{e}^{y_k}, \end{align}

which, on the log scale, gives us

\begin{align} \ln{\left(P_{\mathbf{y}}(\mathbf{y} | \boldsymbol{\alpha})\right)} &= \ln{\left(P_{\mathbf{x}}(\text{e}^\mathbf{y} | \boldsymbol{\alpha}) \prod_{k=1}^K\text{e}^{y_k}\right)} \\ &= \ln{\left(\frac{1}{B(\boldsymbol{\alpha})} \prod_{k=1}^{K} (\text{e}^{y_k})^{\alpha_k - 1}\right)} + \ln{\left(\prod_{k=1}^K\text{e}^{y_k}\right)} \\ &= \sum_{k=1}^K\ln{\left(\text{e}^{y_k(\alpha_k - 1)}\right)} - \ln{(B(\boldsymbol{\alpha}))} + \sum_{k=1}^K \ln{\text{e}^{y_k}} \\ &= \sum_{k=1}^K y_k\alpha_k - y_k + y_k - \ln{(B(\boldsymbol{\alpha}))} \\ &= \sum_{k=1}^K y_k\alpha_k - \ln{(B(\boldsymbol{\alpha}))} \end{align}

I’m not seeing where the -log_theta[N] term comes from here.