Label switching in Latent dirichlet allocation model in the manual

yuling · July 26, 2019, 3:47am

It seems to me that the stan code for LDA model in the manual does not deal with the label switching. Does it make sense to use ordered simplex? i.e.,

data {
  int<lower=2> K;               // num topics
  int<lower=2> V;               // num words
  int<lower=1> M;               // num docs
  int<lower=1> N;               // total word instances
  int<lower=1,upper=V> w[N];    // word n
  int<lower=1,upper=M> doc[N];  // doc ID for word n
  vector<lower=0>[K] alpha;     // topic prior
  vector<lower=0>[V] beta;      // word prior
}
parameters {
  positive_ordered[K] theta_first; // topic dist for 1st doc
  simplex[K] theta_ex_first[M-1];   // topic dist for the remaining doc m
  simplex[V] phi[K];     // word dist for topic k
}
transformed parameters {
  simplex[K] theta_first_transform = theta_first / sum(theta_first); 
  simplex[K] theta[M];  // the combine matrix 
  theta[1]=theta_first_transform;
  theta[2:M]=theta_ex_first;
}
model {
  for(k in 1:K)
     theta_first[k]~gamma(alpha[k], 1); // use gamma prior to produce dirichlet on simplex
  for (m in 1:(M-1))
    theta_ex_first[m] ~ dirichlet(alpha);  // prior
  for (k in 1:K)
    phi[k] ~ dirichlet(beta);     // prior
  for (n in 1:N) {
    real gamma[K];
    for (k in 1:K)
      gamma[k] = log(theta[doc[n], k]) + log(phi[k, w[n]]);
    target += log_sum_exp(gamma);  // likelihood;
  }
}

generated quantities {
  vector[N] log_lik;
  for (n in 1:N) {
     real gamma[K];
     for (k in 1:K)
       gamma[k] = log(theta[doc[n], k]) + log(phi[k, w[n]]);
     log_lik[n] = log_sum_exp(gamma); 
   }
}

It does not fix all problems as I still find divergence and large R-hat in some simulations. But at least it reduces the non-identification from label-switching.

In a simple simulation with a few hundred documents, the average R hat for all parameters in 200 parallel chains in the non-ordered model is 1.6, and is 1.2 in the ordered one (by the way computing R hat for a slightly large number of chains is extremely slow). Notably, the loo_elpd is almost identical in two models. Hence the label-switching does not affect the predictive performance. Indeed the average R hat for log_lik is always nearly 1 in both cases, which means there are multiple modes in the parameter distribution that are equally-predictive, and are only partly due to label-switching.

Topic		Replies	Views
Mulitvariate Mixture (HMM) Parameter Ordering to Prevent Label Switching Modeling techniques	1	859	May 2, 2018
LDA for word proportions fit issue Modeling fitting-issues	8	821	August 12, 2018
Naive bayes: combining unlabeled and labeled data Modeling specification	2	400	February 4, 2019
Topic model with an outcome variable Modeling	1	496	January 3, 2020
Latent Dirichlet Allocation model Modeling	1	1216	August 14, 2017

Label switching in Latent dirichlet allocation model in the manual

Related topics