LDA: topics do not separate

alaho · December 19, 2019, 9:09pm

Hi everyone,

I am a new learner of Stan.

I am running LDA on about 200 short documents, each with about 10-30 words.
When I tried to fit the LDA model into the corpus, each topic in each document gets around 33% little variation in the distribution. And when I print out the top words for each topic, it is very similar to each other, especially for two topics. Are there any way to diagnose why this may be happening?

And how may I improve other than trying to have number of topics = 2 instead of 3?

Are there any ways to get to see the perplexity, coherence, or likelihood for a LDA fit?

data {
  int<lower=2> K;               // num topics
  int<lower=2> V;               // num words
  int<lower=1> M;               // num docs
  int<lower=1> N;               // total word instances
  int<lower=1,upper=V> w[N];    // word n
  int<lower=1,upper=M> doc[N];  // doc ID for word n
  vector<lower=0>[K] alpha;     // topic prior
  vector<lower=0>[V] beta;      // word prior
}
parameters {
  simplex[K] theta[M];   // topic dist for doc m:   169
  simplex[V] phi[K];     // word dist for topic k  :827
}
model {
  for (m in 1:M)
    theta[m] ~ dirichlet(alpha);  // prior
  for (k in 1:K)
    phi[k] ~ dirichlet(beta);     // prior
  for (n in 1:N) {
    real gamma[K];
    for (k in 1:K)
      gamma[k] = log(theta[doc[n], k]) + log(phi[k, w[n]]);
    target += log_sum_exp(gamma);  // likelihood;
    
  }
}

martinmodrak · January 3, 2020, 12:52pm

Hi, this is quite out of my expertise, but I noticed @Bob_Carpenter has been involved in a few LDA-related topics here, so hopefully he’s not busy and able to provide some insight.

My overall impression from skimming the forums is however that LDAs are hard :-(

Ara_Winter · January 3, 2020, 3:00pm

I work with LDA’s for ecology (basically take research sites and turn them into a corpus). If you can post the whole R code I can take a look at it. Sometimes if I suspect my LDA setup is off I’ll run it through the topicmodels package in R. Also through python version just to double check.

Ara_Winter · January 3, 2020, 3:18pm

Oh what are your priors set to?

Topic		Replies	Views
Running LDA in Stan Modeling cmdstan , techniques , specification , example-models	2	423	July 12, 2023
Trying to add a covariate in LDA topic model Modeling	1	731	January 3, 2020
Implementing the LDA example from the user guide with rstan Modeling	1	1296	July 9, 2019
LDA for word proportions fit issue Modeling fitting-issues	8	831	August 12, 2018
Stan manual LDA model with trivial data reporting all divergent samples Modeling	1	924	February 6, 2018

LDA: topics do not separate

Related topics