Log_dirichlet_rng(alpha) and multinomial_log_rng(log_theta)

Hi!

I’m working with a multinomial-Dirichlet model in which I transformed everything to the log form to address fitting issues. I have the _lpdf/_lpmf functions specified correctly, but had to specify the model in a way such that I can only keep their draws if I generate them with the _rng versions. How do you specify log_dirichlet_rng(alpha) and multinomial_log_rng(log_theta)?

This is what I have but I don’t know where to go from here, especially because I copied this from someone who figured out the math and don’t quite understand it myself. Thank you!

real multinomial_log_lpmf(array[] int y, vector log_theta) {
    int N = sum(y);
    int K = num_elements(log_theta);
    real lp = lgamma(N + 1);
    for (k in 1:K) {
      lp += log_theta[k] * y[k] - lgamma(y[k] + 1);
    }
  
    return lp;
}
real log_dirichlet_lpdf(vector log_theta, vector alpha) {
    int N = rows(log_theta);
    if (N != rows(alpha)) {
      reject("Input must contain same number of elements as alpha");
    }
      
    return dot_product(alpha, log_theta) - log_theta[N]
          + lgamma(sum(alpha)) - sum(lgamma(alpha));
}

The RNGs are easy to define as you don’t need to worry about changes of variables.

vector logdirichlet_rng(vector alpha) {
  return log(dirichlet_rng(alpha));
}

vector multinomial_log_rng(vector log_theta) {
  return multinomial_rng(exp(log_theta));
}

If those wind up not being stable, you can do the same kinds of tricks, but you probably don’t care so much about overflow/underflow here as the results aren’t going to feed into anything else. If you need more stability in the logdirichlet_rng, you can probably code it via gammas and then convert everything to the log scale.

But I just realized after trying to define a logdirichlet distribution that we can’t easily constrain the log of the simplex to be such that its exponentiation sums to 1. Let’s see what happens after working through the math.

For the N-dimensional Dirichlet on the probability simplex scale, we have for \theta \in \Delta^{N-1} (a simplex in N coordinates) and \alpha \in (0, \infty)^N,

\displaystyle \textrm{Dirichlet}(\theta \mid \alpha) = \frac{\Gamma(\text{sum}(\alpha))}{\text{sum}(\Gamma(\alpha))} \cdot \prod_{n=1}^N \theta_i^{\ \alpha_i - 1}.

Now if we want to do a change-of-variables, we start with our simplex variable \Theta and then define a new random variable Z = \log \Theta, we can use the change-of-variables formula to work out the density of Z using the inverse of \log(), namely \exp(), as

\displaystyle p_Z(z \mid \alpha) = \text{Dirichlet}(\exp(z) \mid \alpha) \cdot \left| J_{\exp}(z) \right|,

where J_{\exp(z)} is the derivative matrix of the inverse transform f(\theta) = \exp(\theta) and the bars are absolute determinant operations. Here we have to constrain z such that \exp(z) \in \Delta^{N-1} (i.e., it’s a simplex). But this part’s going to be hard to implement. Let’s put that aside for a second.

Because the Jacobian is diagonal, this works out to

\displaystyle \left| J_{\exp}(z) \right| = \prod_{n=1}^N \left| \frac{\text{d}}{\text{d} z_n} \exp(z_n)\right| = \prod_{n=1}^N \exp(z_n),

so that on the linear scale, we have

\displaystyle p_Z(z \mid \alpha) = \frac{\Gamma(\text{sum}(\alpha))}{\text{sum}(\Gamma(\alpha))} \cdot \prod_{n=1}^N \exp(z_i)^{\ \alpha_i - 1} \cdot \prod_{n=1}^N \exp(z_n).

Now we can just take a log and get

\displaystyle \log p_Z(z \mid \alpha) = \log \Gamma(\text{sum}(\alpha)) - \log \text{sum}(\Gamma(\alpha)) + \sum_{n=1}^N (\alpha_i - 1) \cdot z_i + \sum_{n=1}^N z_i.

We can now cancel the z_i to be left with this:

\displaystyle \log p_Z(z \mid \alpha) = \log \Gamma(\text{sum}(\alpha)) - \log \text{sum}(\Gamma(\alpha)) + \sum_{n=1}^N \alpha_i \cdot z_i.

I tried this in the following Stan program, which does not work. It will define the right density function, but not the right constrained parameter. Also, it looks like you had a stray -log_theta[N] term in your density definition (or I messed up some algebra somewhere). Working out the correct density in Stan code, you get the following, but it won’t sample properly.

INCORRECT STAN PROGRAM

functions {
  real logdirichlet_lpdf(vector log_theta, vector alpha) {
    return dot_product(alpha, log_theta)
      + lgamma(sum(alpha)) - sum(lgamma(alpha));
  }
}
data {
  int<lower=0> N;
  vector<lower=0>[N] alpha;
}
parameters {
  vector<upper=0>[N] log_theta;  // UPPER = 0 NOT SUFFICIENT!!!
}
model {
  log_theta ~ logdirichlet(alpha);
}
generated quantities {
  vector[N] theta = exp(log_theta);  // should be a simplex
  real sum_theta = exp(log_sum_exp(log_theta));  // should be 0 every iteration
}

This isn’t going to work because we haven’t constrained log_theta so that sum(exp(log_theta)) = 1 as we would get if log_theta is the logarithm of a simplex. To make this work, you’d need to work out an unconstraining transform for a log simplex. It could probably be done, but it’s a lot of work. You need to take one of the simplex transforms, such as our stick-breaking procedure, and keep the whole thing on the log scale and work out the transformed Jacobian adjustments.

That’s funny—I was going to ping @spinkney, who is much better at this kind of math than I am. I believe he has some transforms working with Jacobian adjustments on the log scale.

I worked it out last night on paper. You can stay completely on the log scale with the stick-breaking transform we use. Assume x \in \mathbb{R}^N. We’ll define a y = f(x) such that y represents the first N - 1 values of a simplex, with the full simplex being given by taking the last element to be one minus the sum of the first elements.

\displaystyle \begin{array}{rcl} y_1 & = & \text{logit}^{-1}(x_1) \\ y_2 & = & (1 - y_1) \cdot \text{logit}^{-1}(x_2) \\ y_3 & = & (1 - y_1 - y_2) \cdot \text{logit}^{-1}(x_3) \\ & \vdots & \\ y_N & = & (1 - y_1 - \cdots - y_{N-1}) \cdot \text{logit}^{-1}(x_{N}) \end{array}

and then

y_{N + 1} = (1 - y_1 - \cdots - y_{N})

for the last element. f has a triangular Jacobian, so it’s just

\displaystyle \left| J_f \right| = \prod_{n=1}^N \frac{\partial}{\partial x_n} y_n

or on the log scale,

\displaystyle \left| J_f \right| = \sum_{n=1}^N \log \frac{\partial}{\partial x_n} y_n

Now we can instead transform using g(x) = \log f(x). The Jacobian just gets another log wrapper on the outside, so I’ll leave that as an exercise, but the thing to note is that the computation goes by.

\displaystyle \begin{array}{icl} \log y_n & = & \log \left( (1 - y_1 - \cdots - y_{n-1}) \cdot \text{logit}^{-1}(x_n) \right) \\ & = & \textrm{log1p}(-y_1 - \cdots - y_{n-1}) + \log \text{logit}^{-1}(x_n) \\ & = & \textrm{log1p}(-y_1 - \cdots - y_{n-1}) + \log \dfrac{1}{1 + \exp(-x_n)} \\ & = & \textrm{log1p}(-y_1 - \cdots - y_{n-1}) - \log (1 + \exp(-x_n)) \\ & = & \textrm{log1p}(-y_1 - \cdots - y_{n-1}) - \text{logSumExp}(0, -x_n) \end{array}

The operations here are all arithmetically stable. The Jacobian adjustment is just the sum of the log derivatives of the individual transforms, which is just the derivative of the second term, because x_n doesn’t show up in the first.

\displaystyle \begin{array}{lcl} \log \dfrac{\text{d}}{\text{d}x} \text{logit}^{-1}(x) & = & \log \left( \text{logit}^{-1}(x) \cdot (1 - \text{logit}^{-1}(x)) \right) \\ & = & \log \left( \text{logit}^{-1}(x) \right) + \textrm{log1p}\!\left(-\text{logit}^{-1}(x)\right) \\ & = & -\text{logSumExp}(0, -x) + -x - \text{logSumExp}(0, -x) \\ & = & -x - 2 \cdot \text{logSumExp}(0, -x) \end{array}

I would double check my algebra! There’s a lot of sign flipping going on.

Thank you, this is all above and beyond! Your algebra all seems correct to me, although someone else is probably more likely to find a mistake than me, if anyone.