Jacobian of softmax tansformation of a (n-1 degrees of freedom) unbounded parameter

@andrjohns I found this jacobian formula (https://arxiv.org/pdf/1712.09936.pdf). Could you please confirm if you think it is the same ot the one you know?

I have too many doubts about the notation to really have a clear idea.

Yep itā€™s the same. Noting first that the \odot symbol is used to represent elementwise multiplication, we can expand the operands on either side:

s1^T = \begin{bmatrix}s_1 \\s_2 \end{bmatrix} \cdot \begin{bmatrix}1 & 1 \end{bmatrix} = \begin{bmatrix}s_1 & s_1 \\s_2 & s_2 \end{bmatrix}
(I-s1^T)^T = \left(\begin{bmatrix}1 & 0 \\0 & 1 \end{bmatrix} - \begin{bmatrix}s_1 & s_1 \\s_2 & s_2 \end{bmatrix}\right)^T
= \left(\begin{bmatrix}1-s_1 & -s_1 \\-s_2 & 1-s_2 \end{bmatrix}\right)^T
= \begin{bmatrix}1-s_1 & -s_2 \\-s_1 & 1-s_2 \end{bmatrix}

Then:

s1^T \odot (I-s1^T)^T = \begin{bmatrix}s_1 & s_1 \\s_2 & s_2 \end{bmatrix} \odot \begin{bmatrix}1-s_1 & -s_2 \\-s_1 & 1-s_2 \end{bmatrix}
= \begin{bmatrix}s_1-s_1^2 & -s_1s_2 \\-s_2s_1 & s_2-s_2^2 \end{bmatrix}

Which is the same as the formulas/methods above

2 Likes

@betanalpha is it possible the the whole mistake comes from the fact that we are mapping K-1 to a K vector? Although it is not the softmax that does that per-se, I though to just make sure.

Our case I assume is number 2 (one-to many; i.e. K-1 vector to K simplex)

One-to-many refers to the behavior of the transformation but still presumes the the dimensional of the input and output spaces are the same. The concept of a Jacobian determinant, and the corresponding transformation of density functions, is not well-defined when trying to map between spaces of different dimensions.

With a simplex constraint one has to map from K - 1 variables to K - 1 variables (for example starting with K variables but fixing one to a constant, say 0 or minus the sum of the others) or map K variables into K variables and then explicitly marginalize one of those excess variables to map down to K - 1 variables (what we do in Stan).

Hi! this is an old thread but, Iā€™m trying to evaluate transforms this summer with bob, and I was just making a DIY softmax with stan, my code turns out to be similar to the one in this thread. Iā€™m confused about what would be a good starting point to thinking about the Kth variable, like stickbreaking for simplex uses 1-sumoverothers i think. are there advantages to explicitly marginalize them?

One subtlety with these implementations is that even if the constrained space is symmetric constraining and unconstraining transformations need not be. For example with the stick-breaking construction of the simplex one has to choose an arbitrary ordering of the latent variables. This is also true for formally-identified softmax constructions.

Typically a softmax-simplex is identified by fixing one of the variables to a fixed value, usually 0 because that admits a nice probabilistic meaning in the right circumstances. In other words one introduces not K unconstrained variables but rather K - 1 variables and defines the first K - 1 states of the simplex as

y_{k} = \frac{ \exp(x_{k}) }{ \exp(0) + \sum_{k' = 1}^{K - 1} \exp(x_{k'}) }

and the last state as

y_{K} = \frac{ \exp(0) }{ \exp(0) + \sum_{k' = 1}^{K - 1} \exp(x_{k'}) }.

As in the stick-breaking case the meaning of ā€œfirstā€ and ā€œlastā€ here relies on the choice of an arbitrary ordering.

Without this state fixing the resulting simplex will be non-identified ā€“ an infinite number of input configurations will lead to the same output simplex configuration. In particular all of the input states can be translated by a fixed amount without changing the simplex variables,

\begin{align*} y_{k} &= \frac{ \exp(x_{k}) }{ \sum_{k' = 1}^{K } \exp(x_{k'}) } \\ &= 1 \cdot \frac{ \exp(x_{k}) }{ \sum_{k' = 1}^{K } \exp(x_{k'}) } \\ &= \frac{ \exp(\alpha)}{\exp(\alpha)} \cdot \frac{ \exp(x_{k}) }{ \sum_{k' = 1}^{K} \exp(x_{k'}) } \\ &= \frac{ \exp(x_{k} + \alpha) }{ \sum_{k' = 1}^{K} \exp(x_{k'} + \alpha) }. \end{align*}

Likelihood functions that constrain the y_{k} will then result in non-identified likelihood functions for the latent x_{k}, and terrible computational consequences.

At this point there isnā€™t any probabilistic structure that would allow us to marginalize one of the x_{k} to try to avoid this problem (which is why fixing one of the x_{k} is most common). In order to consider marginalization one would need some explicit probabilistic structure.

For example if

\pi(x_{1}, \ldots, x_{K}) = \prod_{k = 1}^{K} \text{gamma}(x_{k}; 1, 1)

and

y_{k} = \frac{ x_{k} }{ \sum_{k' = 1}^{K} x_{k'} }

then \pi(y_{1}, \ldots, y_{K} is uniform over the simplex. Integrating out one of the x_{k} is nontrivial, however, because that marginalization has to be propagated through the definition of the y_{k}. I have not seen this kind of approach implemented anywhere before.

3 Likes

thanks for clearing this up, this is really helpful, i was missing the intuition of identifiability somewhat earlier.

I have not seen this kind of approach implemented anywhere before.

Did you mean that for simplex/softmax/this particular transform or in general? since you mentioned thatā€™s what we do in stan (I understand how the transform works from the docs but I find c++ somewhat difficult to follow so Iā€™m asking)

Letā€™s be careful to separate out to similar but distinct concepts:

Transformations map points in one space to points in another, in this case an unconstrained space to a constrained space. For example in Stan we use the stick breaking transformation to map a K complex to K - 1 unconstrained real variables.

A pushforward distribution quantifies how a transformation warps a probability distribution over the input space (or representations of a probability distribution like samples or a probability density function) to give a corresponding probability distribution over the output space. For example the transformation

y_{k} = \frac{ x_{k} }{ \sum_{k' = 1}^{K} x_{k'} }

pushes forward the density function over the input space

\pi(x_{1}, \ldots, x_{K}) = \prod_{k = 1}^{K} \text{gamma}(x_{k}; 1, 1)

to a uniform probability density function over the output space,

\pi(y_{1}, \ldots, y_{K}) \propto \text{constant}.

Critically transformations can be defined in the context of input and output spaces alone. In other words they can be applied to any Stan program. A pushfoward distribution, however, has to be defined in the context of an input space, an output space, and a probability distribution on the input space to be well defined.

This is all to say that I have seen the y_{k} = x_{k}/ \sum_{k' = 1}^{K} x_{k'} transformation used in statistical applications before, but only in the particular instance of pushing forward a product of gammas density function into a Dirichlet density function. In other words itā€™s used just as a way of implementing a Dirichlet density function over a simplex, but not more general models over simplifies. That said it fundamentally suffers from the identifiability issue discussed above so even though itā€™s mathematically appropriate itā€™s terrible in practice without some additional heuristics to temper the identifiability.

If weā€™re talking about general statistics or probabilistic programming applications, however, then I havenā€™t seen y_{k} = x_{k}/ \sum_{k' = 1}^{K} x_{k'} used to implement simplices. All of the applications I have encountered have used the stick breaking construction or constructions equivalent to it (such as the hypersphere transformation I introduced way back in the day, [1010.3436] Cruising The Simplex: Hamiltonian Monte Carlo and the Dirichlet Distribution).

Late to the party, but Iā€™ll add a specific answer, without having to call log_determinant (which is costly and introduces problems).

If we have \mathbf{v} \in \mathbb{R}^{K-1} and then prepend 0 as the first element, we obtain a K simplex \mathbf{x} \in [0,1]^K, \sum_{i=1}^K x_i = 1 via softmax as

s = 1 + \sum_{i=1}^{K-1} \exp (v_i), \ x_1 = \frac{1}{s}, \ x_k = \frac{\exp (v_{k - 1})}{s}

We can then show that the Jacobian determinant is:

\frac{\exp \sum_{i=1}^{K - 1} v_i}{s^{K}}.

and so the log determinant is

\sum_{i=1}^{K - 1}v_i - K * \log s

A full derivation can be found in Section 5 of https://arxiv.org/pdf/2211.02383.pdf and
an implementation of the transform as a Stan fuction is:

functions {
  vector simplex_constrain_softmax_lp(vector v) {
     int K = size(v) + 1;
     vector[K] v0 = append_row(0, v);
     // Jacobian
     target += sum(v) - K * log_sum_exp(v0);
     return softmax(v0);
  }
}
3 Likes