Log-Simplex Constraints

I’m working with extremely high-dimensional Dirichlet and Multinomial distributions, necessitating usage of large simplexes with many extremely small values. As a result, my modeling is far more stable and efficient in the log space, and I’ve been working on implementing my own log-simplex type constraint inspired by Martin Modrak’s softmax transform, here.

Happy to post my derivation if it helps, but when I re-derive the log-Jacobian adjustment using the log-softmax version of Martin’s approach, I end up with the following function that converts from unconstrained to log-simplex space:

vector logsoftmax_transform_jacobian(vector z) {
    // Computes the log of the softmax function for a vector z and updates the target
    // variable with the absolute value of the Jacobian determinant of the transformation.

    // Append '0' to the vector z
    int K = size(z) + 1;
    vector[K] z_extended = append_row(z, 0);

    // Normalize the extended vector on the log scale
    vector[K] y = z_extended - log_sum_exp(z_extended);

    // Jacobian determinant correction
    jacobian += y[K];

    return y;
}

As expected, sampling is much more efficient when I stay in the log space; however, I’m finding that the last value of my y vector, above, has severe convergence issues (rhat > 1.5).

Rather than a “role your own” log-simplex transform, I’ve also been playing around with the transforms provided in this repository, linked to by this other forum post. When using the ExpandedSoftmax or ILR versions of the transform, convergence issues disappear, so I’m quite confident that the convergence issues I’m seeing come from my log-softmax transform and not somewhere else in my code.

I’m also quite sure my derivation of the log-softmax Jacobian adjustment is correct (again, though, happy to post it if it helps); so, my question now is why does that approach not work while the ExpandedSoftmax or ILR approaches do?

Kind of a related question, too: By my calculations, the unnormalized PDF of a random variable (assuming constant \alpha) whose exponentiation is Dirichlet-distributed (i.e., P(\text{log_theta} | \alpha) is just the dot product of that variable and the \alpha parameter vector after accounting for the Jacobian adjustment. Yet, in the definition of the log_dirichlet_lpdf found in the same repo as the above transforms, it has an extra -log_theta[N] term. Where’s this coming from? I noticed in the earlier-linked forum post that, at least for the example listed, this can be ignored so long as you also ignore the log_x term in the log-simplex transform. Is this the case for all transforms listed in this repo? As in, if I’m using the inv_ilr_log_simplex_constrain_lp function defined here, can I also drop the target += log_x[N] term so long as I don’t have -log_theta[N] in my exponential-Dirichlet PDF?

2 Likes

The right person to answer might be @spinkney, as he just spearheaded an effort to change the default simplex transformation in Stan from the stick breaking transform to the ILR.

My (quite handwavey) understanding is that the behavior you observe is a known issue with stick breaking, where the transform can exert a lot of “pressure” on the final elements, since when they are being processed all the preceding elements are essentially fixed.

2 Likes

I also second a request for native support for log simplex/Dirichlet!

1 Like

The issue is that by setting the last value to 0 all the other values are forced to vary more strongly to accommodate the simplex constraint. The sampler struggles when the curvature of the region varies strongly and different parameterizations induce stronger or weaker curvature for the sampler to explore. The expanded softmax and ILR approaches are meant to distribute the simplex constraint evenly across every value to make sampling easier.

The change of variables to the log space induces this extra log_x[N] term and we put it in explicitly to note it. When converting the simplex distribution to a log simplex distribution there will be an extra -log_theta[N]. Fortunately, this cancels out from the mapping theta from the vector Euclidean space to the simplex space since we have the log_theta[N] term there.

In short, you can just leave this off in both the parameterization and the distribution knowing that it cancels out.

2 Likes

Thanks, @spinkney.

The issue is that by setting the last value to 0 all the other values are forced to vary more strongly to accommodate the simplex constraint. The sampler struggles when the curvature of the region varies strongly and different parameterizations induce stronger or weaker curvature for the sampler to explore. The expanded softmax and ILR approaches are meant to distribute the simplex constraint evenly across every value to make sampling easier.

That part makes sense.

When converting the simplex distribution to a log simplex distribution there will be an extra -log_theta[N]. Fortunately, this cancels out from the mapping theta from the vector Euclidean space to the simplex space since we have the log_theta[N] term there.

I obviously haven’t done the derivations for the ExpandedSoftmax or ILR approaches, so I’ll take your word for the Jacobian adjustment there :) I don’t still don’t understand where the log_theta[N] is coming from for the log-simplex form of the Dirichlet distribution. Here’s my derivation of the Jacobian correction for it:

The transform and inverse-transform relative to Dirichlet-distributed simplex \mathbf{x} are as follows, respectively:

\begin{align} f(\mathbf{x}) &= \ln{\mathbf{x}} = \mathbf{y} \\ f^{-1}(\mathbf{y}) &= \text{e}^{\mathbf{y}} = \mathbf{x}. \end{align}

The Jacobian correction of the inverse function can be calculated as follows:

\begin{align} J_{f^{-1}} = \begin{bmatrix} \frac{\partial x_1}{\partial y_1} & \dotsm & \frac{\partial x_1}{\partial y_K} \\ \vdots & \ddots & \vdots \\ \frac{\partial x_K}{\partial y_1} & \dotsm & \frac{\partial x_K}{\partial y_K} \end{bmatrix} =\begin{bmatrix} \text{e}^{y_1} & 0 & \dotsm & 0 \\ 0 & \text{e}^{y_2} & \dotsm & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \dotsm & \text{e}^{y_K} \end{bmatrix} \\ \end{align}

The determinant of a diagonal matrix is just the product of the diagonal elements, so the Jacobian adjustment is

\begin{align} \left|\text{det }J_{f^{-1}}(\mathbf{y})\right| = \prod_{k=1}^K\text{e}^{y_k} \end{align}

Putting everything together, the probability for the exponential-Dirichlet distribution is

\begin{align} P_{\mathbf{y}}(\mathbf{y} | \boldsymbol{\alpha}) = P_{\mathbf{x}}(\text{e}^\mathbf{y} | \boldsymbol{\alpha}) \prod_{k=1}^K\text{e}^{y_k}, \end{align}

which, on the log scale, gives us

\begin{align} \ln{\left(P_{\mathbf{y}}(\mathbf{y} | \boldsymbol{\alpha})\right)} &= \ln{\left(P_{\mathbf{x}}(\text{e}^\mathbf{y} | \boldsymbol{\alpha}) \prod_{k=1}^K\text{e}^{y_k}\right)} \\ &= \ln{\left(\frac{1}{B(\boldsymbol{\alpha})} \prod_{k=1}^{K} (\text{e}^{y_k})^{\alpha_k - 1}\right)} + \ln{\left(\prod_{k=1}^K\text{e}^{y_k}\right)} \\ &= \sum_{k=1}^K\ln{\left(\text{e}^{y_k(\alpha_k - 1)}\right)} - \ln{(B(\boldsymbol{\alpha}))} + \sum_{k=1}^K \ln{\text{e}^{y_k}} \\ &= \sum_{k=1}^K y_k\alpha_k - y_k + y_k - \ln{(B(\boldsymbol{\alpha}))} \\ &= \sum_{k=1}^K y_k\alpha_k - \ln{(B(\boldsymbol{\alpha}))} \end{align}

I’m not seeing where the -log_theta[N] term comes from here.

You left out the simplex constraint. The map

\begin{align} f(\mathbf{x}) &= \ln{\mathbf{x}} = \mathbf{y} \\ f^{-1}(\mathbf{y}) &= \text{e}^{\mathbf{y}} = \mathbf{x} \end{align}

has an additional element which is that \sum_i \mathbf{x} = 1 and 0 \lt x_i \lt 1. The standard square Jacobian does not apply.

Let

x_j \;=\;e^{z_j},\quad j=1,\dots ,N-1, \qquad x_N \;=\;1-\sum_{j=1}^{N-1} e^{z_j}.

Define g(\mathbf z)=\mathbf x:\mathbb R^{N-1}\to\mathbb R^{N}

The N\times(N-1) Jacobian is

J(\mathbf z)= \begin{bmatrix} \operatorname{diag}(x_1,\dots ,x_{N-1})\\[2pt] -x_1\; \dots\; -x_{N-1} \end{bmatrix}, \qquad x_j=e^{z_j}.

The volume adjustment is given by the square-root of the Gram determinant

|J|_{+}\;=\;\bigl\lvert J^{\!\top}J\bigr\rvert^{1/2} \;=\;\sqrt{N}\;\prod_{j=1}^{N-1}x_j \;=\;\sqrt{N}\;\exp\!\Bigl(\sum_{j=1}^{N-1}z_j\Bigr).

From the Dirichlet we have

f_{\mathbf X}(\mathbf x) =\frac{\Gamma(\alpha_0)} {\prod_{i=1}^{N}\Gamma(\alpha_i)} \prod_{i=1}^{N}x_i^{\alpha_i-1}, \qquad \alpha_0=\sum_{i=1}^{N}\alpha_i .

Which has the N simplex coordinates. However, we know that there are only N - 1 “free” coordinates and we can express this as

\begin{align} x_i & =\exp(z_i)\;(i\le N-1) \\ x_N &=1-\sum e^{z_i} \end{align}

Plugging in we get

f_{\mathbf Z}(\mathbf z) = \frac{\Gamma(\alpha_0)}{\prod_{i=1}^{N}\Gamma(\alpha_i)} \;\sqrt{N}\; \left[\prod_{i=1}^{N-1}e^{(\alpha_i-1)z_i}\right] \left(1-\!\sum_{j=1}^{N-1}e^{z_j}\right)^{\alpha_N-1} \! \exp\!\Bigl(\sum_{i=1}^{N-1}z_i\Bigr)\\

Combining terms and taking the log gives us the log PDF

\boxed{\; \log f_{\mathbf Z}(\mathbf z) = \ldots +\tfrac12\log N +\sum_{i=1}^{N-1} \alpha_i z_i +(\alpha_N-1)\, \log\!\Bigl(1-\sum_{i=1}^{N-1}e^{z_i}\Bigr) \;\quad}

Then we highlight that

\underbrace{\tfrac12\log N}_{\text{constant}} \;+\; \underbrace{\log\!\Bigl(1-\sum_{i=1}^{N-1}e^{z_i}\Bigr) }_{\text{ = }\log x_N} \; .

Then we have

\begin{aligned} \log f_{\mathbf Z}(\mathbf z) &= \ldots +\tfrac12\log N +\sum_{i=1}^{N-1} \alpha_i z_i + \alpha_N \log\!\Bigl(1-\sum_{i=1}^{N-1}e^{z_i}\Bigr) - \log\!\Bigl(1-\sum_{i=1}^{N-1}e^{z_i}\Bigr) \\ &= \ldots +\tfrac12\log N +\sum_{i=1}^{N-1} \alpha_i z_i + \alpha_N \log x_N - \log x_N \; . \end{aligned}

The code for log_dirichlet_lpdf in that repo is

/**
 * Return the Dirichlet density for the specified log simplex.
 *
 * @param theta a vector on the log simplex (N rows)
 * @param alpha prior counts plus one (N rows)
 */
real log_dirichlet_lpdf(vector log_theta, vector alpha) {
  int N = rows(log_theta);
  if (N != rows(alpha))
    reject("Input must contain same number of elements as alpha");
  return dot_product(alpha, log_theta) - log_theta[N]
         + lgamma(sum(alpha)) - sum(lgamma(alpha));
}

We see that the dot_product term includes N variables and all that is left is the -log_theta[N].

1 Like

I was working under the assumption that the initial transform from unconstrained to constrained space handled the necessary adjustments for the simplex consideration, but this makes sense. Thank you!

Though, it does raise another question that I’ve not considered before: As you show, the log_theta[N] term appears because the Dirichlet distribution is defined only for simplices. However, many other distributions are also defined only over specific regions of space–for example, the exponential distribution is undefined for y < 0–but we don’t take any analogous considerations when transforming them. What’s fundamentally different here? Is it less about the constraints and more about the co-dependence of the Dirichlet variables?

Apologies for the rabbit hole. I’m just trying to round out my understanding here.

All this gets into measure theory that I can bumble around but I’m by no means an expert. Any mapping across different dimensions adds additional complexity. The simplex constraint must continue to be satisfied even after transforming to log-space so we need to take it into account when adjusting the probability distribution or measure. When way to put this is that the machinery you see here for the log simplex is just more apparent than in simpler cases. Let’s take you’re exponential case, for instance. Stan samples on the unconstrained Euclidean space \theta \in \mathbb R which is called the ordinary Lebesgue measure. To get to the positive reals we take any bijective map from the real line to the positive reals. Stan uses the x = \exp(y) function and the Jacobian is just 1 x 1 and is J=dx/dy=e^{y}=x.

The exponential log density is defined on \mathbb R_+

\log f(x \mid \lambda) = \log \bigl(λ\,e^{-λx}\bigr) \;= \log \lambda - \lambda x

But since we’re in unconstrained Euclidean we need to adjust the density with the change of variables given by

\log \left | \frac{d x}{d y} \right | = \log \left | \frac{d }{d y} \exp(y) \right | = y = \log(x)

which must be added to the density to give

\log f(x \mid \lambda) = \log \lambda - \lambda x + y

What if we had the exponential density for \log(x)?

So that,

X \sim \operatorname{Exp}(\lambda), \quad Y = \log(X)

Add the change of variables

f_Y(y) = \lambda e^y e^{-\lambda e^y} \\ \log f_Y(y) = \log \lambda + y - \lambda e^y,

you see how y needs to be here? Similarly for the log Dirichlet.

I think the parts that’s tripping me up is that it feels a bit like “double counting” to adjust the Jacobian for the simplex bounds both when transforming from unconstrained vector space to log-simplex space and when adjusting the Dirichlet PDF for the change of variables. By the time we’ve transformed to the log-simplex space, we’ve already accounted for the bounds, so why do they appear again in the correction going from log-simplex to simplex?

Put a different way, in the exponential example you provided, the way I think about it is that “+y” comes from changing variables from x to y and doesn’t have anything to do with the fact that x must be greater than 0. If we used a different transformation than e^y to go from unconstrained to constrained space, (10^y for example), we would end up with a different correction term but the constraint that x be greater than 0 nonetheless satisfied. If the bounds were coming into play explicitly in the correction, shouldn’t we see some constant term in the Jacobian correction for it regardless of how we perform our transformation?

Put yet one more different way: To me, the simplex bounds are just part of the definition of the Dirichlet distribution–they’re implicit to it, just like the bound that x > 0 for the exponential. For instance, when we use an exponential distribution, even though it is mathematically valid to put in a negative value for x (I’ll get a number out, it just won’t have probabilistic meaning), we define the exponential distribution to not be supported here. Similarly, it is mathematically valid to put in any vector with positive elements to a Dirichlet distribution (again, I’ll get a number, it just won’t be meaningful), but we define the Dirichlet distribution to not be supported here. The bounds are less about the space we’re operating in and more about how we define the distribution, so don’t feel like they should be considered in a Jacobian correction unless we’re explicitly enforcing them by transforming the space in some way.

Fascinating discussion that I’m not sure I entirely follow. I have a couple questions because I also work with high-dimension simplexes fairly often and have been really interested in the ILR transform.

If I follow your math correctly, why is the \frac{1}{2} \mathrm{log }N constant added to the Jacobian in inv_ilr_log_simplex_constrain_lp but dropped in log_dirichlet_lpdf?

Would the Jacobian adjustments in the inv_ilr_log_simplex_constrain_lp be needed if you put a prior on the N - 1 elements of the simplex directly rather than using a Dirichlet prior? For example, what if you wanted to do something like a multinomial-ILR directly?

data{ 
  int N;
  array[N] int x;  
}

parameters {
  vector[N - 1]  theta_raw;
}

transformed parameters{
  vector[N]  log_p = inv_ilr_log_simplex_constrain_lp(theta_raw);
}

model {
   //---- prior ----\\
  theta_raw ~ std_normal();

  //---- poor man's multinomial_lupmf ----\\
  target += dot_product(x, log_p);
}

I’m not sure that the dot_product would work with an integer array, but just trying to make it simple.

Just a quick comment: for me confusion about Jacobian adjustments usually clears up when thinking about CDFs. For example, with random variables X, Y = f(X) and f an increasing bijection, you recover the univariate Jacobian adjustment like this:

F_Y(s) = \mbox{Pr}(Y \leq s) = \mbox{Pr}(f(X) \leq s) = \mbox{Pr}(X \leq f^{-1}(s)) = F_X(f^{-1}(s))

No magic so far!

So due to the chain rule, the density of Y has to be:

f_Y(x) = \frac{\partial F_Y(x)}{\partial x} = \frac{\partial F_X(f^{-1}(x))}{\partial x} = f_X(f^{-1}(x))f^{-1}(x)

(and you then take the log). So the Jacobian adjustment arises from taking the derivative of the CDF (which we can directly transform) to get a PDF.

The logic for multivariate PDFs/CDFs is a bit more tricky, but it is the same thing. Jacobian is not magic, its just what happens when you take the derivative due to chain rule.

2 Likes

The issue seems to be relative to what base measure you’re in. Yes, the Dirichlet and the Exponential distributions are defined on their constrained space. Since we start in the unconstrained Euclidean space we have to adjust to go into these spaces. Similarly, when you go from these constrained spaces back to the Lebesgue measure - the base measure in Stan - you adjust again.

It’s an artifact of using this Gram determinant and since it’s a constant it can safely be dropped in both cases.

No, putting priors on the free params allows you to get away without doing the adjustment. In fact, you can just call sum-to-zero vectors and then do a log softmax to get the inv ilr simplex. The sum to zero vector is now built in and Jacobian is 0 so no penalty there and we have a ‘log_softmax()’ function built in as well.

The inv ilr simplex will be the built-in parameterization for simplexes in the next release 2.37. Add an issue to the math GitHub to request the log output to be built, it’s actually not much work so a good chance to get in after 2.37

1 Like

I think thinking about it from the CDF point of view helps a lot with the intuition, so thank you!

If I’m understanding things correctly now, perhaps another intuitive way to think about why we consider the support bounds explicitly when modeling e^y as Dirichlet distributed but not when modeling it as Exponential is because that transform implicitly captures the bounds of exponential support (e^y is always greater than 0) but does not implicitly capture the bounds of Dirichlet support (an exponentiated vector need not sum to 1). We thus have to explicitly consider the bounds in our transform.

Thanks for the insight. I’ve been leaving the Jacobian off in my implementations because I had thought this was the case. I was aware of the sum-to-zero vector + log softmax but for my use case, the inv_ilr_log_simplex_constrain_lp function (without the Jacobian) has been more useful. This is because I’m dealing row-simplex matrices, but where each row has potentially a different length simplex because I know some elements of a row are zero. These are effectively high-dimensional hidden Markov models where the transition matrix is (usually) upper triangular.

My solution has been to declare the raw parameters as a vector and assign normal priors to the elements (basically each row is multi-logit). So for an N x N upper triangular row-simplex matrix we have a size \binom{N}{2} vector of free parameters that gets sliced and log-ILR transformed.

That’s probably more detail than warranted for this thread, but it’s prelude to me saying that the ILR transform code and threads like these on the forum have been invaluable to my work. I really appreciate the work you’ve done in both development and explanation. Thank you!

If you or anyone is interested, see this in-development repo for more details on my use case: Columbia River Research Lab - Quantitative Fisheries Ecology Section / Temporally Stratified Space-for-Time CJS · GitLab

2 Likes

Have to second Dalton’s praise. I was thrilled to find the transform repository and log versions of Dirichlet. Super helpful for my work!

2 Likes

That’s great and I love hearing about how people are using Stan. I try to give back to researchers by these contributions so it’s always nice to hear about people’s successes.

On the inv_ilr_log_simplex function. In the next release of Stan coming out soon you can make it as the below. This updates the suffix to be _jacobian and it uses the newly exposed constraint transform function of sum_to_zero_constrain.

vector inv_ilr_log_simplex_constrain_jacobian(vector y) {
  int N = rows(y) + 1;
  vector[N] z = sum_to_zero_constrain(y);
  real r = log_sum_exp(z);
  return z - r;
}
2 Likes