Label switching in Latent dirichlet allocation model in the manual

It seems to me that the stan code for LDA model in the manual does not deal with the label switching. Does it make sense to use ordered simplex? i.e.,

data {
  int<lower=2> K;               // num topics
  int<lower=2> V;               // num words
  int<lower=1> M;               // num docs
  int<lower=1> N;               // total word instances
  int<lower=1,upper=V> w[N];    // word n
  int<lower=1,upper=M> doc[N];  // doc ID for word n
  vector<lower=0>[K] alpha;     // topic prior
  vector<lower=0>[V] beta;      // word prior
}
parameters {
  positive_ordered[K] theta_first; // topic dist for 1st doc
  simplex[K] theta_ex_first[M-1];   // topic dist for the remaining doc m
  simplex[V] phi[K];     // word dist for topic k
}
transformed parameters {
  simplex[K] theta_first_transform = theta_first / sum(theta_first); 
  simplex[K] theta[M];  // the combine matrix 
  theta[1]=theta_first_transform;
  theta[2:M]=theta_ex_first;
}
model {
  for(k in 1:K)
     theta_first[k]~gamma(alpha[k], 1); // use gamma prior to produce dirichlet on simplex
  for (m in 1:(M-1))
    theta_ex_first[m] ~ dirichlet(alpha);  // prior
  for (k in 1:K)
    phi[k] ~ dirichlet(beta);     // prior
  for (n in 1:N) {
    real gamma[K];
    for (k in 1:K)
      gamma[k] = log(theta[doc[n], k]) + log(phi[k, w[n]]);
    target += log_sum_exp(gamma);  // likelihood;
  }
}

generated quantities {
  vector[N] log_lik;
  for (n in 1:N) {
     real gamma[K];
     for (k in 1:K)
       gamma[k] = log(theta[doc[n], k]) + log(phi[k, w[n]]);
     log_lik[n] = log_sum_exp(gamma); 
   }
}

It does not fix all problems as I still find divergence and large R-hat in some simulations. But at least it reduces the non-identification from label-switching.

In a simple simulation with a few hundred documents, the average R hat for all parameters in 200 parallel chains in the non-ordered model is 1.6, and is 1.2 in the ordered one (by the way computing R hat for a slightly large number of chains is extremely slow). Notably, the loo_elpd is almost identical in two models. Hence the label-switching does not affect the predictive performance. Indeed the average R hat for log_lik is always nearly 1 in both cases, which means there are multiple modes in the parameter distribution that are equally-predictive, and are only partly due to label-switching.

1 Like