Hi everyone,
I am a new learner of Stan.
I am running LDA on about 200 short documents, each with about 10-30 words.
When I tried to fit the LDA model into the corpus, each topic in each document gets around 33% little variation in the distribution. And when I print out the top words for each topic, it is very similar to each other, especially for two topics. Are there any way to diagnose why this may be happening?
And how may I improve other than trying to have number of topics = 2 instead of 3?
Are there any ways to get to see the perplexity, coherence, or likelihood for a LDA fit?
data {
int<lower=2> K; // num topics
int<lower=2> V; // num words
int<lower=1> M; // num docs
int<lower=1> N; // total word instances
int<lower=1,upper=V> w[N]; // word n
int<lower=1,upper=M> doc[N]; // doc ID for word n
vector<lower=0>[K] alpha; // topic prior
vector<lower=0>[V] beta; // word prior
}
parameters {
simplex[K] theta[M]; // topic dist for doc m: 169
simplex[V] phi[K]; // word dist for topic k :827
}
model {
for (m in 1:M)
theta[m] ~ dirichlet(alpha); // prior
for (k in 1:K)
phi[k] ~ dirichlet(beta); // prior
for (n in 1:N) {
real gamma[K];
for (k in 1:K)
gamma[k] = log(theta[doc[n], k]) + log(phi[k, w[n]]);
target += log_sum_exp(gamma); // likelihood;
}
}