Trying to add a covariate in LDA topic model

Youjiajia · December 19, 2019, 7:16am

I want to include a covariate into a LDA model with meta for each documents. For example, the age of the person writing the document.

The generative process I am thinking about is:
for each topic k:
draw \lambda_k \sim Normal(0,1)
draw \phi_k \sim Dir(\beta)
for each document d:
for each topic k, \alpha_{dk} = exp(X_d^T \lambda_k)
draw \theta_d \sim Dir(\alpha_d)
for each word i,
draw z_i \sim Cat(\theta_d)
draw w_i \sim Cat(\phi_{z_i})

This is a simplified version of ( Topic Models Conditioned on Arbitrary Features) https://mimno.infosci.cornell.edu/papers/dmr-uai.pdf
My code is below (my model runs, but I feel I am putting the \theta_m variable in the wrong place). I want to know if people in different ages speaks differently. So, I am thinking the model written this way seems fine. But I do not know where to declare \theta, if I declare it in the model part, Stan gives me an error saying that there is no simplex. But if I put in the parameters part, then I will have both \theta and \lambda. This does not look right to me. Would greatly appreciate any insight!!!

data {
  int<lower=2> K;               // num topics
  int<lower=2> V;               // num words
  int<lower=1> M;               // num docs
  int<lower=1> N;               // total word instances
  int<lower=1,upper=V> w[N];    // word n
  int<lower=1,upper=M> doc[N];  // doc ID for word n
  vector<lower=1>[M] G;         // gender information
 // vector<lower=0>[K] alpha;     // topic prior
  vector<lower=0>[V] beta;      // word prior
  real sigma;                   // simplify
}
parameters {
  simplex[V] phi[K];     // word dist for topic k  :827
  simplex[K] theta[M];  // topic dist
  real lambda[K]; // !!! not sure here what happens if have both parameters
}
model {
  for (k in 1:K){
    lambda[k] ~ normal(0, sigma); // here only gender info, so it is a scalor
    phi[k] ~ dirichlet(beta);    
  }
  
  for (m in 1:M){
    vector[K] alpha;
    for (k in 1:K){
       alpha[k] = exp(G[m] * lambda[k]);
    }
    theta[m] ~ dirichlet(alpha);  
  }
  for (n in 1:N) {
    real gamma[K];
    for (k in 1:K)
      gamma[k] = log(theta[doc[n], k]) + log(phi[k, w[n]]);
    target += log_sum_exp(gamma);  
    }
}

martinmodrak · January 3, 2020, 12:28pm

Hi,
sorry for taking so long to respond. Also, I have no background in LDA models, so below are just my few guesses.

I am not sure I understand your concerns. Your data generating process involves both \theta and \lambda as parameters, so your model should generally do so as well. Such a model should work, but might be slower if d is large. You could however probably simplify the model and make it more efficient by integrating out \theta - if I read your model correctly, you should have z_i \sim \text{DirichletMultinomial} (\alpha_d).

However, please check my reasoning there is a bunch of stuff I do not understand about your model, for example, what is the role of z_i - it doesn’t seem to be handled in the model. Is it observable? If it is unknown you cannot directly handle it in Stan as parameter since it is dicsrete, but instead you would need to marginalize it out which might be tricky and very slow.

Hope that helps!

Topic		Replies	Views
Topic model with an outcome variable Modeling	1	511	January 3, 2020
Running LDA in Stan Modeling cmdstan , techniques , specification , example-models	2	436	July 12, 2023
Stan manual LDA model with trivial data reporting all divergent samples Modeling	1	934	February 6, 2018
Latent Dirichlet Allocation model Modeling	1	1228	August 14, 2017
LDA: topics do not separate Modeling	3	624	January 3, 2020

Trying to add a covariate in LDA topic model

Related topics