Dirichlet-Multinomial for Transition Matrix Estimation

Dear Stan users,
I’m attempting to model my 3-by-3 transition matrix (discrete-time Markov chain) using a Dirichlet-Multinomial conjugate pair system. My data is a “raw” transition matrix: that is, instead of a given entry in the matrix representing the probability of a transition, it represents the number of counts of a transition. As such, the likelihood is inherently trinomial (not a categorical, because order doesn’t matter thanks to the Markov property). However, the likelihood calculation and generation of a posterior Dirichlet (a K=3 simplex) needs to occur three times per iteration — once for each row of my matrix.
So, I need a way to generate a Dirichlet prior (that doesn’t depend on my raw transition matrix), while still having my multinomial likelihood concentrate the information presented in the data into the single “theta” parameter that the multinomial sampling statement affords (14.1 in manual). I’m not sure which block to declare my Dirichlet prior (generated parameters or model?), and which block to define my single Multinomial parameter. I’m also very new to matrix representation in Stan, so I apologize for the presence of syntactical errors. Any tips on this would be greatly appreciated by this incipient Stan user!
Cheers


data {
	// Prior Parameters
   int<lower = 1> K; 		// COLS. for Dirichlet & txn mat
   int<lower = 1> N; 		// ROWS. for txn mat
   real<lower = 0> alpha; 	// for Dirichlet

        // Experimental Parameters
   matrix[N,K] T; 		// input RAW txn matrix, the data
 }

generated quantities { 
   vector[K] theta = dirichlet_rng(rep_vector(alpha, K));  // prior
 }

parameters {
  matrix[N,K] P;         // parameter of interest
}

model {
     theta ~ dirichlet(alpha);	// prior?
    for(i in 1:N) {
    P[i,] ~ multinomial T[i,];        // likelihood
    }
 }
1 Like

Sorry, your question fell through a bit. I don’t immediately see what is the problem with your model, but if I understand it correctly, you want P to be a transition matrix. In that case, I think you should define it as an array of simplices (as that enforces that transition probabilities sum to 1). Additionally, T is counts, so probably should be int. Here is a rough sketch (not tested) how I would code the model:

data {
	// Prior Parameters
   int<lower = 1> K; 		// COLS. for Dirichlet & txn mat
   int<lower = 1> N; 		// ROWS. for txn mat
   real<lower = 0> alpha; 	// for Dirichlet

        // Experimental Parameters
   int T[N,K]; 		// input RAW txn matrix, the data
 }

parameters {
   simplex[K] P[N]; 
}

model {
  for(i in 1:N) {
    P[i] ~ dirichlet(rep_vector(alpha, K)); // Prior, Dirichlet with all arguments the same
    T[i,] ~ multinomial(P[i,]);        // likelihood
  }
}

Hope that clarifies some of the stuff and best of luck with your model!

1 Like

I want to do the similar thing but your model assumes independence between rows of the matrix. Do you know how to account for dependency between rows? Is there a multivariate dirichlet distribution defined in Stan to generate multiple probability vectors by considering their covariance?

One way is to use a copula to measure the dependence but you’d need to write that out and the cdf of the dirichlet, which I believe involves the 1d integrator (see CDF of Dirichlet Distribution). Note, this will be really slow if N is even moderately large (30+). There are N simplexes so you have an N \times N correlation matrix. If you assume that the dependence is captured by a Gaussian copula then you can specify a cholesky_factor_corr (and put lkj priors on if you choose) to capture this dependency. The pseudo-code is below, you will have to code up the cdf of the dirichlet yourself.

functions {
 real multi_normal_copula_lpdf(matrix U, matrix L) {
      int N                  = rows(U);
      int J                  = cols(U);
    // already took inv_Phi outside
    // matrix[N,J] Q          = to_matrix( inv_Phi(to_vector(U)), N, J );
      matrix[J,J] Gammainv   = chol2inv(L);
      return -N * sum(log(diagonal(L))) - 0.5 * sum( add_diag(Gammainv, -1.0) .* crossprod(U) );
   }
 // cdf of dirichlet, need 1d integrator
// https://stats.stackexchange.com/a/544771
 real dirichlet_cdf(){}
}
data {
	// Prior Parameters
   int<lower = 1> K; 		// COLS. for Dirichlet & txn mat
   int<lower = 1> N; 		// ROWS. for txn mat
   real<lower = 0> alpha; 	// for Dirichlet

        // Experimental Parameters
   int T[N,K]; 		// input RAW txn matrix, the data
 }

parameters {
  cholesky_factor_corr[N] L;
  simplex[K] P[N]; 
}
model {
matrix[1, N] y;
  for(i in 1:N) {
    P[i] ~ dirichlet(rep_vector(alpha, K)); // Prior, Dirichlet with all arguments the same
    T[i,] ~ multinomial(P[i,]);        // likelihood
     y[1, i] = inv_Phi(dirichlet_cdf(P[i] |  rep_vector(alpha, K));
  }
   y ~ multi_normal_copula(L);
}
3 Likes

Also, if your transition matrix is doubly-stochastic - it has the property of the rows and columns summing to 1 - then you can specify that from Doubly Stochastic Matrices in Stan.

2 Likes

Thank you for the detailed answer. Just a quick question (sorry if it’s trivial I am still learning Stan) if I want to calculate the likelihood of newly observed stochastic matrices based on the fitted pdf, should I do it in the generated quantities block? Is there an example showing how to do that?

For eg, you have a known array of size N where each is a simplex of length K and you’d assume the same correlation structure? Or do you have another T marix and you need to fit the array of simplices?

If it’s the former then you could calculate the likelihood values anywhere. If you do it in gq then you already have everything. You just name a new real value and increment it by what’s in the model block.

...
generated quantities {
real lp = 0;

matrix[1, N] y_;
  for(i in 1:N) {
    lp += dirichlet_lpdf(P_given[i] | rep_vector(alpha, K)); // Prior, Dirichlet with all arguments the same
    lp += multinomial_lpmf(T[i,] | P_given[i,]);        // likelihood
     y_[1, i] = inv_Phi(dirichlet_cdf(P_given[i] |  rep_vector(alpha, K));
  }
   lp += multi_normal_copula_lpdf(y_[1, i] | L);
}

You can remove the prior if you don’t want that.

2 Likes

Thanks. Yes the former, just want to see in which density quantile a new observed matrix is, using the previously fitted distribution. I was trying to code the Dirichlet CDF using the link you mentioned but I don’t think it’s a 1d integration. According to the link the integral is over the k-1 dimensional space. I have no idea how to do multidimensional integration in Stan. But I found another link: r - Implementation of Dirichlet cdf? - Cross Validated which gives a simulation approximation to the CDF. Do you think it’s possible to use the simulation solution in Stan for the CDF?

Ok, you may want to try the multivariate generalization of the logit normal. Logit-normal distribution - Wikipedia. You don’t need the cdf but you would need to code up the _lpdf in Stan.

I think it looks like. I haven’t tested and I couldn’t find an implementation of the multivariate logistic normal to test the density output.

 functions {
  real multi_logistic_normal_cholesky_lpdf(array[] vector y, vector mu,
                                           matrix L) {
    int N = dims(y)[1];
    int Dm1 = dims(y)[2] - 1;
    real norm_const = -0.5 * N * Dm1 * log(2 * pi());
    real inv_sqrt_det_log = N * sum(log(diagonal(L)));
    matrix[N, Dm1] x;
    matrix[N, Dm1] L_inv_x_minus_mu;
    
    for (n in 1 : N) {
      x[n] = (log(y[n, 1 : Dm1] / y[n, Dm1 + 1]) - mu)';
      norm_const -= sum(log(y[n]));
    }
    
    L_inv_x_minus_mu = mdivide_left_tri_low(L, x);
    
    return norm_const - inv_sqrt_det_log
           - sum(columns_dot_self(L_inv_x_minus_mu));
  }
}
data {
  // Prior Parameters
  int<lower=1> K; // COLS. for Dirichlet & txn mat
  int<lower=1> N; // ROWS. for txn mat
  real<lower=0> alpha; // for Dirichlet
  
  // Experimental Parameters
  array[N, K] int T; // input RAW txn matrix, the data
  array[N] vector[K] P_given;
}
transformed data {
  // see Relationship with the Dirichlet distribution
  // at https://en.wikipedia.org/wiki/Logit-normal_distribution
  // if you want to have different alpha's then use this
  // vector[K - 1] mu;
  
  // for (n in 1:K - 1) 
  // mu[n] = digamma(alpha[n]) - digamma(alpha[N]);
  // since alpha is all equal then
  vector[N - 1] mu = rep_vector(0., K - 1);
}
parameters {
  cholesky_factor_corr[K - 1] L;
  array[N] simplex[K] P;
}
model {
  P ~ multi_logistic_normal_cholesky(mu, L);
  for (i in 1 : N) {
    T[i,  : ] ~ multinomial(P[i,  : ]); // likelihood
  }
}
generated quantities {
  real lp = multi_logistic_normal_cholesky_lpdf(P_given | mu, L);
  
  for (i in 1 : N) 
    lp += multinomial_lpmf(T[i,  : ] | P_given[i,  : ]); // likelihood
}

Ah scratch that, this is just going to give the correlation within the K - 1 elements not across the N :(

It’s really fiddly at least for me. Do you think it’s possible to use a hierarchical model using some hyperparameters on the parameters of the Dirichlet to capture the dependency between rows of the matrix instead of the Copula method?

Do you have some test data?

I have but I am not sure how to share it on here. Here is just a chunk of it:

rbind( c(79,  56,    81,    67,    79,    38,    28,    63,    83,    58,    65,    49,    42,    56,    81,    56), 
 c(90,  53,    67,    63,    55,    36,    27,    55,    87,    38,    48,    29,    40,    47,    60,    67), 
c(217, 104,   128,   149,   146,    64,    27,    78,   126,    67,   104,    79,   108,    81,   118,   100), 
c(151,  78,    91,   128,   115,    52,    25,    71,   102,    52,    99,    78,    80,    81,   116,    90), 
c(280, 148,   189,   194,   199,    94,    35,   111,   215,    85,   137,    79,   117,   112,   155,   123), 
c(165,  72,   139,   114,   123,    67,    36,    68,   154,    76,   107,    60,    47,    79,   116,    71) )

Each row represents a matrix (4x4) with raw count data. Note that matrix is opened in this structure, so colomns 1 to 4 represent the first row of the square matrix, 5:8 the 2nd row, 9:12 the 3rd and 13:16 the 4th row.

Can you test this? I’m using the fact that a matrix-variate normal can be written as a multivariate normal but I’m using the logit-normal representation. Since you don’t assume any correlation of the columns, I make an identity matrix for the column correlation. The row correlation is estimated. This needs to be made into one big correlation matrix via the kronecker product. All this goes in to the logit-normal.

See if the L_rows cholesky correlation matrix makes sense for your problem.

I’m not sure this actually works. The elements within each row could be logit-normal. But the elements across rows are not logit-normal, they are something else (constrained to be between 0, 1 with the added constraint of adding to 1 across the columns). At the end of the day, I’m not sure I can make much more progress without a lot of extra effort (if it’s even possible). Maybe someone else has a better idea or knows how to update the distribution with the additional constraint.

functions {
  matrix chol_kronecker_prod(matrix A, matrix B) {
  int m = rows(A);
  int p = rows(B);
  int N = m * p;
  matrix[N, N] C = rep_matrix(0., N, N);

  for (i in 1:m) {
    for (j in 1:i) {
      if (fabs(A[i, j]) > 1e-12) {
      int row_start = (i - 1) * p + 1;
      int row_end = (i - 1) * p + p;
      int col_start = (j - 1) * p + 1;
      int col_end = (j - 1) * p + p;
      C[row_start:row_end, col_start:col_end] = A[i, j] * B;
      }
    }
  }
  return C;
}
  real multi_logistic_normal_cholesky_lpdf(array[] vector y, mu vector, matrix L) {
    int N = dims(y)[1];
    int Dm1 = dims(y)[2] - 1;
    real norm_const = -0.5 * N * Dm1 * log(2 * pi());
    real inv_sqrt_det_log = N * sum(log(diagonal(L)));
    matrix[N, Dm1] x;
    matrix[N, Dm1] L_inv_x_minus_mu;
   
    for (n in 1:N) {
       x[n] = log(y[n, 1:Dm1] / y[n, Dm1 + 1]) - mu; 
       norm_const -= sum(log(y[n]));
    }
    
    L_inv_x_minus_mu = mdivide_left_tri_low(L, x);

    return norm_const - inv_sqrt_det_log - sum(columns_dot_self(L_inv_x_minus_mu));
}

}
data {
	// Prior Parameters
   int<lower = 1> K; 		// COLS. for Dirichlet & txn mat
   int<lower = 1> N; 		// ROWS. for txn mat
   real<lower = 0> alpha; 	// for Dirichlet

        // Experimental Parameters
   int T[N, K]; 		// input RAW txn matrix, the data
 }
transformed data {
  matrix[K, K] L_cols = diag_matrix(1., K, K);
  vector[N * K] mu_zero = rep_vector(0., N * K);
}
parameters {
   cholesky_factor_corr[N] L_rows;
   simplex[K] P[N]; 
}
model {
  array[1] vector[N * K] P_vec;
  matrix[N * K, N * K] L = chol_kronecker_prod(L_cols, L_rows);
  
  P_vec[1] = to_vector(P);
  
  for(i in 1:N) 
    T[i] ~ multinomial(P[i]);        // likelihood
   
   P_vec ~ multi_logistic_normal_cholesky(mu_zero, L);
}

This error:

SYNTAX ERROR, MESSAGE(S) FROM PARSER:
 error in 'model39ae22e567be_dirichlet_complex' at line 21, column 45
  -------------------------------------------------
    19:   return C;
    20: }
    21:     real multi_logistic_normal_cholesky_lpdf(array[] y vector, mu vector, matrix L) {
                                                    ^
    22:     int N = dims(y)[1];
  -------------------------------------------------

PARSER EXPECTED: <argument declaration or close paren ) to end argument declarations>
Error in stanc(file = file, model_code = model_code, model_name = model_name,  : 
  failed to parse Stan model 'dirichlet_complex' due to the above error.

I haven’t tested you’ll have to debug a bit. That just means I swapped vector and y. Make it

 real multi_logistic_normal_cholesky_lpdf(array[] vector y, mu vector, matrix L

I gotta head out, see my update to the post that this may not actually work.

Gosh, I just kept thinking about this and it’s possibly a bit easier than I thought. Just softmax the multi_normal. I get the following correlation

> row_corr <- matrix(out$summary("row_corr")$mean, 6, 6)
> row_corr
           [,1]       [,2]       [,3]       [,4]        [,5]        [,6]
[1,] 1.00000000 0.19018326 0.04584484 0.25498198  0.17800901  0.14520437
[2,] 0.19018326 1.00000000 0.04175253 0.14225909  0.01818469  0.19861080
[3,] 0.04584484 0.04175253 1.00000000 0.11565893  0.14527326  0.06202513
[4,] 0.25498198 0.14225909 0.11565893 1.00000000  0.12278280  0.09870329
[5,] 0.17800901 0.01818469 0.14527326 0.12278280  1.00000000 -0.01277196
[6,] 0.14520437 0.19861080 0.06202513 0.09870329 -0.01277196  1.00000000

the Stan code is

functions {
  matrix chol_kronecker_prod(matrix A, matrix B) {
  int m = rows(A);
  int p = rows(B);
  int N = m * p;
  matrix[N, N] C = rep_matrix(0., N, N);

  for (i in 1:m) {
    for (j in 1:i) {
      if (fabs(A[i, j]) > 1e-12) {
      int row_start = (i - 1) * p + 1;
      int row_end = (i - 1) * p + p;
      int col_start = (j - 1) * p + 1;
      int col_end = (j - 1) * p + p;
      C[row_start:row_end, col_start:col_end] = A[i, j] * B;
      }
    }
  }
  return C;
}
}
data {
	// Prior Parameters
   int<lower = 1> K; 		// COLS. for Dirichlet & txn mat
   int<lower = 1> N; 		// ROWS. for txn mat
        // Experimental Parameters
   int T[N, K]; 		// input RAW txn matrix, the data
 }
transformed data {
  matrix[K, K] L_cols = diag_matrix(rep_vector(1., K));
  vector[N * K] mu_zero = rep_vector(0., N * K);
}
parameters {
   cholesky_factor_corr[N] L_rows;
   array[1] vector[N * K] P_vec;
}
transformed parameters {
  array[N] simplex[K] P; 
  
  {
    int counter = 0;
    for (n in 1 : N) {
      P[n] = softmax(P_vec[1, counter + 1 : counter + K]);
      counter += K;
    }
  }
}
model {
  matrix[N * K, N * K] L = chol_kronecker_prod(L_cols, L_rows);
  
 L_rows ~ lkj_corr_cholesky(4);
 
  for(i in 1:N) 
    T[i] ~ multinomial(P[i]);        // likelihood
   
   P_vec ~ multi_normal_cholesky(mu_zero, L);
}
generated quantities {
  matrix[N, N] row_corr = multiply_lower_tri_self_transpose(L_rows);
}

Thank you for the effort. I suppose all of those functions and the cholesky decomposition are just for an efficient computation otherwise the model itself is simply putting an MVN prior on the elements of the matrix to capture their dependency via the covariance matrix and then normalizing them row wise using the softmax to get the multinomial probabilities. So after ftting the model and by using the posterior P vectors the likelihood of a new matrix (T_new) can be calculated as:

  for(i in 1:nrow(T_new)){
    
    x[i] = dmultinom(T_new[i, ], prob = P[i] , log= T)
  
  }

log_likelihood = sum(x)

Please correct me if I got something wrong.

I just had it all from messing around with the logit normal. It’s just more generalized. If you have reason to believe that both the rows and cols have a dependency structure you can use this, otherwise a regular mvn for either the rows or cols works.