Jacobian of softmax tansformation of a (n-1 degrees of freedom) unbounded parameter

If I’m not mistaking you are referring to that prop must have K-1 degrees of freedom. I have implemented the non-overparametrised model but we still get non-convergence.

Is it possible we are doing some trivial mistake in the implementation of the jacobian?

functions{
	vector sum_to_zero(vector v){
		int K = rows(v)+1;
		vector[K] v_0;
		
		v_0[1:(K-1)] = v;
		v_0[K] = -sum(v_0[1:(K-1)]);
		
		return(v_0);
	}
}
transformed data{
  int K = 4;
  vector[K] alpha = [1,2,3,4]';
  matrix[K,K] ident = diag_matrix(rep_vector(1,K));
}
parameters{  vector[K-1] prop; }
transformed parameters {
  simplex[K] prop_soft = softmax(sum_to_zero(prop));
}
model{
  prop_soft ~ dirichlet(alpha);
  target += log_determinant(diag_matrix(prop_soft) - prop_soft*prop_soft');
}

Afterall if I ignore the Jacobian the model “kind of” works (obviously with some divergencies)

functions{
	vector sum_to_zero(vector v){
		int K = rows(v)+1;
		vector[K] v_0;
		
		v_0[1:(K-1)] = v;
		v_0[K] = -sum(v_0[1:(K-1)]);
		
		return(v_0);
	}
}
transformed data{
  int K = 4;
  vector[K] alpha = [1,2,3,4]';
}
parameters{  vector[K-1] prop; }
transformed parameters {  simplex[K] prop_soft = softmax(sum_to_zero(prop));  }
model{    prop_soft ~ dirichlet(alpha);    }


P.S. One of the goals of using uncontrained + jacobian implementation is to allow this model

transformed data{    vector[4] alpha = [1,2,3,4]';     }
parameters{	         simplex[4] prop;                  }
model{               prop ~ dirichlet(alpha);          }

To be more amenable to variational bayes, with the prop parameter being more symetric/normal-like

1 Like

Something is definitely wrong in the first implementation – it looks like the adaptation is failing. It could be numerical problems. You’ll have to follow up with careful investigation of the diagnostics and Hamiltonian Monte Carlo statistics.

1 Like

@andrjohns actually it’s strange that

First implementation you proposed was looking like this

jacobian[i,j] = prop[n,i] * (ident[i,j] - prop[n,j]);

Second, the ident matrix disappeared

diag_matrix(prop_soft) - prop_soft*prop_soft'

is possible that a typo happened?

No, they’re mathematically identical. The second form is just a more concise expression.

Consider the 2-element case as an example:

diag(s) - ss^T

Expands to:

\begin{bmatrix}s_1 & 0 \\0 & s_2 \end{bmatrix} - \begin{bmatrix}s_1^2 & s_1s_2 \\s_2s_1 & s_2^2 \end{bmatrix}
\begin{bmatrix}s_1 -s_1^2 & -s_1s_2 \\-s_2s_1 & s_2 - s_2^2 \end{bmatrix}

Which is the same as:

\begin{bmatrix}s_1 (1-s_1) & s_1(0-s_2) \\s_2(0-s_1) & s_2(1 - s_2) \end{bmatrix}

Which is what the loop with an identity matrix was expressing

2 Likes

With limited knowledge in Jocabians it comes hard to even “poke” the statement to see what part can cause such abrupt failure.

afterall it’s just one line of code :)

Could you please point me to a reference where the above is described? I will try to investigate a bit better and maybe try stackexchange

Sure, there are two references I used for this:

Thanks a lot,

this model “works” (with some divergencies caused by extreme correlation of parameters that I will try to solve with)

there was apossible mistake in this prop_soft*prop_soft' instead of prop_soft'*prop_soft

functions{
	vector sum_to_zero(vector v){
		int K = rows(v)+1;
		vector[K] v_0;
		
		v_0[1:(K-1)] = v;
		v_0[K] = -sum(v_0[1:(K-1)]);
		
		return(v_0);
	}
}
transformed data{
  int K = 4;
  vector[K] alpha = [1,2,3,4]';
  matrix[K,K] ident = diag_matrix(rep_vector(1,K));
}
parameters{  vector[K-1] prop; }
transformed parameters {
  simplex[K] prop_soft = softmax(sum_to_zero(prop));
}
model{
  prop_soft ~ dirichlet(alpha);
  //target += log_determinant(diag_matrix(prop_soft) - prop_soft'*prop_soft);
}

there was apossible mistake in this prop_soft*prop_soft' instead of prop_soft'*prop_soft

Unfortunately that’s not the case. prop_soft*prop_soft' gives a K x K matrix, whereas prop_soft'*prop_soft gives a 1 x 1 matrix:

The notation in the link above used s^Ts because they were treating the softmax as returning a row-vector

1 Like

OK what is puzzling me that if I use this model with no jacobian, and with QR approach to N-1 degrees of freedom vector (Test: Soft vs Hard sum-to-zero constrain + choosing the right prior for soft constrain).

functions{
vector Q_sum_to_zero_QR(int N) {
    vector [2*N] Q_r;

    for(i in 1:N) {
      Q_r[i] = -sqrt((N-i)/(N-i+1.0));
      Q_r[i+N] = inv_sqrt((N-i) * (N-i+1));
    }
    return Q_r;
  }

  vector sum_to_zero_QR(vector x_raw, vector Q_r) {
    int N = num_elements(x_raw) + 1;
    vector [N] x;
    real x_aux = 0;

    for(i in 1:N-1){
      x[i] = x_aux + x_raw[i] * Q_r[i];
      x_aux = x_aux + x_raw[i] * Q_r[i+N];
    }
    x[N] = x_aux;
    return x;
  }
}
transformed data{
  int K = 4;
  vector[K] alpha = [1,2,3,4]';
  matrix[K,K] ident = diag_matrix(rep_vector(1,K));
  
  vector[2*K] Q_r = Q_sum_to_zero_QR(K);
  real x_raw_sigma = inv_sqrt(1 - inv(K));
}
parameters{  vector[K-1] prop; }
transformed parameters {
  simplex[K] prop_soft = softmax(sum_to_zero_QR(prop, Q_r));
}
model{
  prop_soft ~ dirichlet(alpha);
  prop ~ normal(0, x_raw_sigma);
}```

```r
fit2 = stan(model_code=stancode, cores=4, iter = 10000, warmup = 1000)

I get same marginal posterior of the simplex compared to the the source model

transformed data{    vector[4] alpha = [1,2,3,4]';     }
parameters{	         simplex[4] prop;                  }
model{               prop ~ dirichlet(alpha);  
fit= stan(model_code = stancode, cores=4, iter = 10000, warmup = 1000)

tidybayes::spread_draws(fit2, prop_soft[N]) %>% 
    select(N,prop_soft) %>% 
    arrange(N,prop_soft) %>%
    mutate(i = 1:n()) %>% 
    left_join(
        tidybayes::spread_draws(fit, prop[N]) %>% 
            select(N,prop) %>%
            arrange(N,prop) %>%
            mutate(i = 1:n()) 
    ) %>% 
    ggplot(aes(prop, prop_soft)) + 
    geom_point() + 
    facet_wrap(~N) + 
    geom_abline()

image

pairs softmax - no jacobian

image

pairs source model

image

Then, when I add the simple jacobian everything goes crazy.

1 Like

Actually I can see a bit of curvature compared to the comparison of two sampling of the source simplex model.

image

However the curvature seems rather small.

1 Like

Just a small study for highlighting the consequences of the absence of Jacobian correction. Summary: I notice that for the model stated above, where it really makes the difference is for K = 2. But maybe I am not getting some important nuances of the Jacobian correction.

For K = 2 (x-axis is the simplex from a native non transformed model)

image

For K = 3 (x-axis is the simplex from a native non transformed model)

image

For K = 4 (x-axis is the simplex from a native non transformed model)

image

However for K=3 we can see curvature if alpha = [1,1,1].

image

Anyway this was just a personal experiment that might be intersting for some.

@andrjohns I found this jacobian formula (https://arxiv.org/pdf/1712.09936.pdf). Could you please confirm if you think it is the same ot the one you know?

I have too many doubts about the notation to really have a clear idea.

Yep it’s the same. Noting first that the \odot symbol is used to represent elementwise multiplication, we can expand the operands on either side:

s1^T = \begin{bmatrix}s_1 \\s_2 \end{bmatrix} \cdot \begin{bmatrix}1 & 1 \end{bmatrix} = \begin{bmatrix}s_1 & s_1 \\s_2 & s_2 \end{bmatrix}
(I-s1^T)^T = \left(\begin{bmatrix}1 & 0 \\0 & 1 \end{bmatrix} - \begin{bmatrix}s_1 & s_1 \\s_2 & s_2 \end{bmatrix}\right)^T
= \left(\begin{bmatrix}1-s_1 & -s_1 \\-s_2 & 1-s_2 \end{bmatrix}\right)^T
= \begin{bmatrix}1-s_1 & -s_2 \\-s_1 & 1-s_2 \end{bmatrix}

Then:

s1^T \odot (I-s1^T)^T = \begin{bmatrix}s_1 & s_1 \\s_2 & s_2 \end{bmatrix} \odot \begin{bmatrix}1-s_1 & -s_2 \\-s_1 & 1-s_2 \end{bmatrix}
= \begin{bmatrix}s_1-s_1^2 & -s_1s_2 \\-s_2s_1 & s_2-s_2^2 \end{bmatrix}

Which is the same as the formulas/methods above

2 Likes

@betanalpha is it possible the the whole mistake comes from the fact that we are mapping K-1 to a K vector? Although it is not the softmax that does that per-se, I though to just make sure.

Our case I assume is number 2 (one-to many; i.e. K-1 vector to K simplex)

One-to-many refers to the behavior of the transformation but still presumes the the dimensional of the input and output spaces are the same. The concept of a Jacobian determinant, and the corresponding transformation of density functions, is not well-defined when trying to map between spaces of different dimensions.

With a simplex constraint one has to map from K - 1 variables to K - 1 variables (for example starting with K variables but fixing one to a constant, say 0 or minus the sum of the others) or map K variables into K variables and then explicitly marginalize one of those excess variables to map down to K - 1 variables (what we do in Stan).

Hi! this is an old thread but, I’m trying to evaluate transforms this summer with bob, and I was just making a DIY softmax with stan, my code turns out to be similar to the one in this thread. I’m confused about what would be a good starting point to thinking about the Kth variable, like stickbreaking for simplex uses 1-sumoverothers i think. are there advantages to explicitly marginalize them?

One subtlety with these implementations is that even if the constrained space is symmetric constraining and unconstraining transformations need not be. For example with the stick-breaking construction of the simplex one has to choose an arbitrary ordering of the latent variables. This is also true for formally-identified softmax constructions.

Typically a softmax-simplex is identified by fixing one of the variables to a fixed value, usually 0 because that admits a nice probabilistic meaning in the right circumstances. In other words one introduces not K unconstrained variables but rather K - 1 variables and defines the first K - 1 states of the simplex as

y_{k} = \frac{ \exp(x_{k}) }{ \exp(0) + \sum_{k' = 1}^{K - 1} \exp(x_{k'}) }

and the last state as

y_{K} = \frac{ \exp(0) }{ \exp(0) + \sum_{k' = 1}^{K - 1} \exp(x_{k'}) }.

As in the stick-breaking case the meaning of “first” and “last” here relies on the choice of an arbitrary ordering.

Without this state fixing the resulting simplex will be non-identified – an infinite number of input configurations will lead to the same output simplex configuration. In particular all of the input states can be translated by a fixed amount without changing the simplex variables,

\begin{align*} y_{k} &= \frac{ \exp(x_{k}) }{ \sum_{k' = 1}^{K } \exp(x_{k'}) } \\ &= 1 \cdot \frac{ \exp(x_{k}) }{ \sum_{k' = 1}^{K } \exp(x_{k'}) } \\ &= \frac{ \exp(\alpha)}{\exp(\alpha)} \cdot \frac{ \exp(x_{k}) }{ \sum_{k' = 1}^{K} \exp(x_{k'}) } \\ &= \frac{ \exp(x_{k} + \alpha) }{ \sum_{k' = 1}^{K} \exp(x_{k'} + \alpha) }. \end{align*}

Likelihood functions that constrain the y_{k} will then result in non-identified likelihood functions for the latent x_{k}, and terrible computational consequences.

At this point there isn’t any probabilistic structure that would allow us to marginalize one of the x_{k} to try to avoid this problem (which is why fixing one of the x_{k} is most common). In order to consider marginalization one would need some explicit probabilistic structure.

For example if

\pi(x_{1}, \ldots, x_{K}) = \prod_{k = 1}^{K} \text{gamma}(x_{k}; 1, 1)

and

y_{k} = \frac{ x_{k} }{ \sum_{k' = 1}^{K} x_{k'} }

then \pi(y_{1}, \ldots, y_{K} is uniform over the simplex. Integrating out one of the x_{k} is nontrivial, however, because that marginalization has to be propagated through the definition of the y_{k}. I have not seen this kind of approach implemented anywhere before.

3 Likes

thanks for clearing this up, this is really helpful, i was missing the intuition of identifiability somewhat earlier.

I have not seen this kind of approach implemented anywhere before.

Did you mean that for simplex/softmax/this particular transform or in general? since you mentioned that’s what we do in stan (I understand how the transform works from the docs but I find c++ somewhat difficult to follow so I’m asking)

Let’s be careful to separate out to similar but distinct concepts:

Transformations map points in one space to points in another, in this case an unconstrained space to a constrained space. For example in Stan we use the stick breaking transformation to map a K complex to K - 1 unconstrained real variables.

A pushforward distribution quantifies how a transformation warps a probability distribution over the input space (or representations of a probability distribution like samples or a probability density function) to give a corresponding probability distribution over the output space. For example the transformation

y_{k} = \frac{ x_{k} }{ \sum_{k' = 1}^{K} x_{k'} }

pushes forward the density function over the input space

\pi(x_{1}, \ldots, x_{K}) = \prod_{k = 1}^{K} \text{gamma}(x_{k}; 1, 1)

to a uniform probability density function over the output space,

\pi(y_{1}, \ldots, y_{K}) \propto \text{constant}.

Critically transformations can be defined in the context of input and output spaces alone. In other words they can be applied to any Stan program. A pushfoward distribution, however, has to be defined in the context of an input space, an output space, and a probability distribution on the input space to be well defined.

This is all to say that I have seen the y_{k} = x_{k}/ \sum_{k' = 1}^{K} x_{k'} transformation used in statistical applications before, but only in the particular instance of pushing forward a product of gammas density function into a Dirichlet density function. In other words it’s used just as a way of implementing a Dirichlet density function over a simplex, but not more general models over simplifies. That said it fundamentally suffers from the identifiability issue discussed above so even though it’s mathematically appropriate it’s terrible in practice without some additional heuristics to temper the identifiability.

If we’re talking about general statistics or probabilistic programming applications, however, then I haven’t seen y_{k} = x_{k}/ \sum_{k' = 1}^{K} x_{k'} used to implement simplices. All of the applications I have encountered have used the stick breaking construction or constructions equivalent to it (such as the hypersphere transformation I introduced way back in the day, [1010.3436] Cruising The Simplex: Hamiltonian Monte Carlo and the Dirichlet Distribution).