I’m working with extremely high-dimensional Dirichlet and Multinomial distributions, necessitating usage of large simplexes with many extremely small values. As a result, my modeling is far more stable and efficient in the log space, and I’ve been working on implementing my own log-simplex type constraint inspired by Martin Modrak’s softmax transform, here.
Happy to post my derivation if it helps, but when I re-derive the log-Jacobian adjustment using the log-softmax version of Martin’s approach, I end up with the following function that converts from unconstrained to log-simplex space:
vector logsoftmax_transform_jacobian(vector z) {
// Computes the log of the softmax function for a vector z and updates the target
// variable with the absolute value of the Jacobian determinant of the transformation.
// Append '0' to the vector z
int K = size(z) + 1;
vector[K] z_extended = append_row(z, 0);
// Normalize the extended vector on the log scale
vector[K] y = z_extended - log_sum_exp(z_extended);
// Jacobian determinant correction
jacobian += y[K];
return y;
}
As expected, sampling is much more efficient when I stay in the log space; however, I’m finding that the last value of my y vector, above, has severe convergence issues (rhat > 1.5).
Rather than a “role your own” log-simplex transform, I’ve also been playing around with the transforms provided in this repository, linked to by this other forum post. When using the ExpandedSoftmax or ILR versions of the transform, convergence issues disappear, so I’m quite confident that the convergence issues I’m seeing come from my log-softmax transform and not somewhere else in my code.
I’m also quite sure my derivation of the log-softmax Jacobian adjustment is correct (again, though, happy to post it if it helps); so, my question now is why does that approach not work while the ExpandedSoftmax or ILR approaches do?
Kind of a related question, too: By my calculations, the unnormalized PDF of a random variable (assuming constant \alpha) whose exponentiation is Dirichlet-distributed (i.e., P(\text{log_theta} | \alpha) is just the dot product of that variable and the \alpha parameter vector after accounting for the Jacobian adjustment. Yet, in the definition of the log_dirichlet_lpdf found in the same repo as the above transforms, it has an extra -log_theta[N] term. Where’s this coming from? I noticed in the earlier-linked forum post that, at least for the example listed, this can be ignored so long as you also ignore the log_x term in the log-simplex transform. Is this the case for all transforms listed in this repo? As in, if I’m using the inv_ilr_log_simplex_constrain_lp function defined here, can I also drop the target += log_x[N] term so long as I don’t have -log_theta[N] in my exponential-Dirichlet PDF?