LDA: topics do not separate

Hi everyone,

I am a new learner of Stan.

I am running LDA on about 200 short documents, each with about 10-30 words.
When I tried to fit the LDA model into the corpus, each topic in each document gets around 33% little variation in the distribution. And when I print out the top words for each topic, it is very similar to each other, especially for two topics. Are there any way to diagnose why this may be happening?

And how may I improve other than trying to have number of topics = 2 instead of 3?

Are there any ways to get to see the perplexity, coherence, or likelihood for a LDA fit?

data {
  int<lower=2> K;               // num topics
  int<lower=2> V;               // num words
  int<lower=1> M;               // num docs
  int<lower=1> N;               // total word instances
  int<lower=1,upper=V> w[N];    // word n
  int<lower=1,upper=M> doc[N];  // doc ID for word n
  vector<lower=0>[K] alpha;     // topic prior
  vector<lower=0>[V] beta;      // word prior
}
parameters {
  simplex[K] theta[M];   // topic dist for doc m:   169
  simplex[V] phi[K];     // word dist for topic k  :827
}
model {
  for (m in 1:M)
    theta[m] ~ dirichlet(alpha);  // prior
  for (k in 1:K)
    phi[k] ~ dirichlet(beta);     // prior
  for (n in 1:N) {
    real gamma[K];
    for (k in 1:K)
      gamma[k] = log(theta[doc[n], k]) + log(phi[k, w[n]]);
    target += log_sum_exp(gamma);  // likelihood;
    
  }
}

Hi, this is quite out of my expertise, but I noticed @Bob_Carpenter has been involved in a few LDA-related topics here, so hopefully he’s not busy and able to provide some insight.

My overall impression from skimming the forums is however that LDAs are hard :-(

I work with LDA’s for ecology (basically take research sites and turn them into a corpus). If you can post the whole R code I can take a look at it. Sometimes if I suspect my LDA setup is off I’ll run it through the topicmodels package in R. Also through python version just to double check.

1 Like

Oh what are your priors set to?