It seems to me that the stan code for LDA model in the manual does not deal with the label switching. Does it make sense to use ordered simplex? i.e.,
data {
int<lower=2> K; // num topics
int<lower=2> V; // num words
int<lower=1> M; // num docs
int<lower=1> N; // total word instances
int<lower=1,upper=V> w[N]; // word n
int<lower=1,upper=M> doc[N]; // doc ID for word n
vector<lower=0>[K] alpha; // topic prior
vector<lower=0>[V] beta; // word prior
}
parameters {
positive_ordered[K] theta_first; // topic dist for 1st doc
simplex[K] theta_ex_first[M-1]; // topic dist for the remaining doc m
simplex[V] phi[K]; // word dist for topic k
}
transformed parameters {
simplex[K] theta_first_transform = theta_first / sum(theta_first);
simplex[K] theta[M]; // the combine matrix
theta[1]=theta_first_transform;
theta[2:M]=theta_ex_first;
}
model {
for(k in 1:K)
theta_first[k]~gamma(alpha[k], 1); // use gamma prior to produce dirichlet on simplex
for (m in 1:(M-1))
theta_ex_first[m] ~ dirichlet(alpha); // prior
for (k in 1:K)
phi[k] ~ dirichlet(beta); // prior
for (n in 1:N) {
real gamma[K];
for (k in 1:K)
gamma[k] = log(theta[doc[n], k]) + log(phi[k, w[n]]);
target += log_sum_exp(gamma); // likelihood;
}
}
generated quantities {
vector[N] log_lik;
for (n in 1:N) {
real gamma[K];
for (k in 1:K)
gamma[k] = log(theta[doc[n], k]) + log(phi[k, w[n]]);
log_lik[n] = log_sum_exp(gamma);
}
}
It does not fix all problems as I still find divergence and large R-hat in some simulations. But at least it reduces the non-identification from label-switching.
In a simple simulation with a few hundred documents, the average R hat for all parameters in 200 parallel chains in the non-ordered model is 1.6, and is 1.2 in the ordered one (by the way computing R hat for a slightly large number of chains is extremely slow). Notably, the loo_elpd is almost identical in two models. Hence the label-switching does not affect the predictive performance. Indeed the average R hat for log_lik is always nearly 1 in both cases, which means there are multiple modes in the parameter distribution that are equally-predictive, and are only partly due to label-switching.