Dirichlet-Multinomial for Transition Matrix Estimation

It depends. If you want to add the priors then you can calculate that effect in Stan and add this to your log_likelihood parameter

generated quantities {
  real lp = lkj_corr_cholesky_lpdf(L_rows | 4);
  matrix[N, N] row_corr = multiply_lower_tri_self_transpose(L_rows);
  
  lp += multi_normal_cholesky_lpdf(P_vec | mu_zero, chol_kronecker_prod(L_cols, L_rows));
}

then in R add lp to the log_likelihood parameter you have.

I was playing around with this a bit more and got the multivariate logit-normal to work. It’s actually quite easy once you realize that this is just a slight reparameterization of the multivariate normal. All we need is the transform and the appropriate jacobian adjustment. So here we go…

I’m going to rip this from the wiki page, the only “new” thing I did was extend it to a matrix-variate outcome:

The probability density function is:

f_X( \mathbf{x}; \boldsymbol{\mu} , \boldsymbol{\Sigma} ) = \frac{1}{ | (2 \pi)^{D-1} \boldsymbol{\Sigma} |^\frac{1}{2} } \, \frac{1}{ \prod\limits_{i=1}^D x_i } \, e^{- \frac{1}{2} \left\{ \log \left( \frac{ \mathbf{x}_{-D} }{ x_D } \right) - \boldsymbol{\mu} \right\}^\top \boldsymbol{\Sigma}^{-1} \left\{ \log \left( \frac{ \mathbf{x}_{-D} }{ x_D } \right) - \boldsymbol{\mu} \right\} } \quad , \quad \mathbf{x} \in \mathcal{S}^D \;\; ,

where \mathbf{x}_{-D} denotes a vector of the first (D-1) components of \mathbf{x} and \mathcal{S}^D denotes the simplex of D-dimensional probability vectors. This follows from applying the additive logistic transformation to map a multivariate normal random variable \mathbf{y} \sim \mathcal{N} \left( \boldsymbol{\mu} , \boldsymbol{\Sigma} \right) \; , \; \mathbf{y} \in \mathbb{R}^{D-1} to the simplex:

\mathbf{x} = \left[ \frac{ e^{ y_1 } }{ 1 + \sum_{i=1}^{D-1} e^{ y_i } } , \dots , \frac{ e^{ y_{D-1} } }{ 1 + \sum_{i=1}^{D-1} e^{ y_i } } , \frac{ 1 }{ 1 + \sum_{i=1}^{D-1} e^{ y_i } } \right]^\top

The unique inverse mapping is given by:

\mathbf{y} = \left[ \log \left( \frac{ x_1 }{ x_D } \right) , \dots , \log \left( \frac{ x_{D-1} }{ x_D } \right) \right]^\top.

Instead of a vector we have a matrix. Taking a look at the matrix-variate normal pdf

p(\mathbf{X}\mid\mathbf{M}, \mathbf{U}, \mathbf{V}) = \frac{\exp\left( -\frac{1}{2} \, \mathrm{tr}\left[ \mathbf{V}^{-1} (\mathbf{X} - \mathbf{M})^{T} \mathbf{U}^{-1} (\mathbf{X} - \mathbf{M}) \right] \right)}{(2\pi)^{np/2} |\mathbf{V}|^{n/2} |\mathbf{U}|^{p/2}}

we see that if we transform one of either the rows or cols with the above transform and add the jacobian adjustment it will turn out to be a matrix-variate logit-normal on one dimension of the matrix. Here’s where I’m not 100% sure but I think because the adjustment happened to the value on one dimension we don’t have to worry about the other dimension but maybe I’m wrong. Anyway, that new pdf looks like

p(\mathbf{X}\mid\mathbf{M}, \mathbf{U}, \mathbf{V}) = \frac{\exp\left( -\frac{1}{2} \, \mathrm{tr}\left[ \mathbf{V}^{-1} (f(\mathbf{X}) - \mathbf{M})^{T} \mathbf{U}^{-1} (f(\mathbf{X}) - \mathbf{M}) \right] \right)}{(2\pi)^{n(D - 1)/2} |\mathbf{V}|^{n/2} |\mathbf{U}|^{p/2} \prod\limits_{i=1}^{ (D - 1)n } x_i}

where \mathbf{X} is a N \times D matrix given by

\mathbf{X} = \begin{bmatrix} \frac{ e^{ y_{1,1} } }{ 1 + \sum_{i=1}^{D-1} e^{ y_{1,i} } } & \dots & \frac{ e^{ y_{1, D-1} } }{ 1 + \sum_{i=1}^{D-1} e^{ y_{1,i} } } & \frac{ 1 }{ 1 + \sum_{i=1}^{D-1} e^{ y_{1,i} } } \\ \frac{ e^{ y_{2,1} } }{ 1 + \sum_{i=1}^{D-1} e^{ y_{2,i} } } & \dots & \frac{ e^{ y_{2, D-1} } }{ 1 + \sum_{i=1}^{D-1} e^{ y_{2,i} } } & \frac{ 1 }{ 1 + \sum_{i=1}^{D-1} e^{ y_{2,i} } } \\ \vdots & \ddots & \vdots & \vdots \\ \frac{ e^{ y_{n,1} } }{ 1 + \sum_{i=1}^{D-1} e^{ y_{n,i} } } & \dots & \frac{ e^{ y_{n, D-1} } }{ 1 + \sum_{i=1}^{D-1} e^{ y_{n,i} } } & \frac{ 1 }{ 1 + \sum_{i=1}^{D-1} e^{ y_{n,i} } } \end{bmatrix}

and f(\mathbf{X}) is a N \times D - 1 matrix given by

f(\mathbf{X}) = \begin{bmatrix} \log\big(\frac{ x_{1,1} }{ x^{1, D-1}}\big) & \dots & \log\big(\frac{ x_{1,D - 1} }{ x^{1, D}}\big) \\ \log\big(\frac{ x_{2,1} }{ x^{2, D-1}}\big) & \dots & \log\big(\frac{ x_{2,D - 1} }{ x^{2, D}}\big) \\ \vdots & \ddots & \vdots \\ \log\big(\frac{ x_{n,1} }{ x^{n, D-1}}\big) & \dots & \log\big(\frac{ x_{n,D - 1} }{ x^{n, D}}\big) \\ \end{bmatrix}.

The Jacobian is a diagonal matrix composed of x_i^{-1}. Since each x_i is positive due to the simplex constraint. The determinant of this is the product of the diagonal. So the log-det is

\text{log-abs-jac} = -\sum x_i

What all this means in practice is that we can reduce the number of parameters because our Cholesky matrix is now 1-dimension lower, which is nice and we can keep the simplex specification. The matrix-variate normal can be written as a MVN by \mathrm{vec}(\mathbf{X}) \sim \mathcal{N}_{np}(\mathrm{vec}(\mathbf{M}), \mathbf{V} \otimes \mathbf{U}) where the \otimes is a Kronecker product. It would perhaps be faster to do the specification of the density above depending on the size of n and D. However, in Stan the MVN cholesky parameterization has derivatives, which speed up the computation considerably vs a hand-written pdf. Anyway, the Stan code is

functions {
  matrix chol_kronecker_prod(matrix A, matrix B) {
    int m = rows(A);
    int p = rows(B);
    int N = m * p;
    matrix[N, N] C = rep_matrix(0., N, N);
    
    for (i in 1 : m) {
      for (j in 1 : i) {
        if (fabs(A[i, j]) > 1e-12) {
          int row_start = (i - 1) * p + 1;
          int row_end = (i - 1) * p + p;
          int col_start = (j - 1) * p + 1;
          int col_end = (j - 1) * p + p;
          C[row_start : row_end, col_start : col_end] = A[i, j] * B;
        }
      }
    }
    return C;
  }
  
}
data {
  // Prior Parameters
  int<lower=1> K; // COLS. for Dirichlet & txn mat
  int<lower=1> N; // ROWS. for txn mat
  // Experimental Parameters
  array[N, K] int T; // input RAW txn matrix, the data
}
transformed data {
  matrix[K - 1, K - 1] L_cols = diag_matrix(rep_vector(1., K - 1));
//  vector[N * (K - 1) ] mu_zero = rep_vector(0., N * (K - 1) );
}
parameters {
  cholesky_factor_corr[N] L_rows;
  simplex[K] P[N]; 
  vector[N * (K - 1) ] mu;
}
transformed parameters {
  array[1] vector[N * (K - 1) ] P_vec;
  real jac = 0;
  
  {
    int counter = 0;
    for (n in 1 : N) {
      jac -= sum(log(P[n]));
      P_vec[1, counter + 1 : counter + K - 1] = log(P[n, 1:K - 1] / P[n, K]); 
      counter += K - 1;
    }
  }
}
model {
  matrix[N * (K - 1) , N * (K - 1) ] L = chol_kronecker_prod(L_cols, L_rows);
  
  mu ~ std_normal();
  L_rows ~ lkj_corr_cholesky(4);
  target += jac;
  for (i in 1 : N) {
    T[i] ~ multinomial(P[i]); // likelihood
  }
  
  P_vec ~ multi_normal_cholesky(mu, L);
}
generated quantities {
  real lp = lkj_corr_cholesky_lpdf(L_rows | 4);
  matrix[N, N] row_corr = multiply_lower_tri_self_transpose(L_rows);
  
  lp += std_normal_lpdf(mu);
  lp += jac;
  lp += multi_normal_cholesky_lpdf(P_vec | mu, chol_kronecker_prod(L_cols, L_rows));
  for ( i in 1 : N) 
    lp += multinomial_lpmf(T[i] | P[i]);
}

I found that by also adding \mu to be fit found a better fit and I was able to reduce adapt-delta back to 0.8.

3 Likes

Thank you for this. Just had a chance to study this solution in more detail. There are some parts I still don’t quite get:

1- In the transformed matrix valued normal pdf why there is still |\boldsymbol{U}|^{p/2}? Shouldn’t the power be (D-1)/2 instead?

2- By dimensions in “we see that if we transform one of either the rows or cols with the above transform and add the jacobian adjustment it will turn out to be a matrix-variate logit-normal on one dimension of the matrix. Here’s where I’m not 100% sure but I think because the adjustment happened to the value on one dimension we don’t have to worry about the other dimension” do you mean column space and row space?

3-Does the Stan code take a single matrix as the input or I am reading it incorrectly? The data is array[N, K] int T; which is a single n x k matrix. If that’s the case then I need to extend it to take many vectorised matrices as input in order to properly estimate the parameters, especially the \boldsymbol{U} which controls the covariance of the rows.

1- In the transformed matrix valued normal pdf why there is still |U|p/2 ? Shouldn’t the power be (D−1)/2 instead?

Yes, I was mixing p and D -1. I’ll update the math but the code is right, just make p = D - 1 or

p(\mathbf{X}\mid\mathbf{M}, \mathbf{U}, \mathbf{V}) = \frac{\exp\left( -\frac{1}{2} \, \mathrm{tr}\left[ \mathbf{V}^{-1} (\mathbf{X} - \mathbf{M})^{T} \mathbf{U}^{-1} (\mathbf{X} - \mathbf{M}) \right] \right)}{(2\pi)^{n(D-1)/2} |\mathbf{V}|^{n/2} |\mathbf{U}|^{\frac{D-1}{2}}}

2- By dimensions in “ we see that if we transform one of either the rows or cols with the above transform and add the jacobian adjustment it will turn out to be a matrix-variate logit-normal on one dimension of the matrix. Here’s where I’m not 100% sure but I think because the adjustment happened to the value on one dimension we don’t have to worry about the other dimension ” do you mean column space and row space?

Yes

3-Does the Stan code take a single matrix as the input or I am reading it incorrectly? The data is array[N, K] int T; which is a single n x k matrix. If that’s the case then I need to extend it to take many vectorised matrices as input in order to properly estimate the parameters, especially the U which controls the covariance of the rows.

You’re right on here and this is what’s interesting and possibly where you want a different model or way to estimate this correlation. You have only 1 matrix and the matrix variate normal model is under the assumption that your data generating process would produce many N \times K random matrices.

1 Like