Topic model with an outcome variable

Youjiajia · December 19, 2019, 7:45am

I am interested in relating LDA to an outcome variable (building the outcome into the generative model rather than working LDA outputs as features for another classification/regression task).

The generative process I am thinking of is:
for each topic k:
draw \phi_k \sim Dir(\beta)
for each document d:
for each topic k:
draw \theta_d \sim Dir(\alpha)
for each word i,
draw z_i \sim Cat(\theta_d)
draw w_i \sim Cat(\theta_{z_i})
draw outcome y_d \sim bernoulli_logit(a + \lambda * \theta_d)

I get an error in the last line of the model part.

And also it seems that in my model part, the last loop of going over 1 to M is unnecessary. However, for the text data, I followed the format of the stan reference manual where https://mc-stan.org/docs/2_18/stan-users-guide/naive-bayes-classification-and-clustering.html

The relevant variables for the program are N , the total number of words in all the documents, the word array w , and the document identity array doc .

So, I am not entirely sure how I should modify the last line of my model.

My code is the following:

data {
  int<lower=2> K;               // num topics
  int<lower=2> V;               // num words
  int<lower=1> M;               // num docs
  int<lower=1> N;               // total word instances
  int<lower=1,upper=V> w[N];    // word n
  int<lower=1,upper=M> doc[N];  // doc ID for word n
  int y[M]; // outcome variable 
  vector<lower=0>[K] alpha;     // topic prior
  vector<lower=0>[V] beta;      // word prior
}
parameters {
  simplex[K] theta[M];   // topic dist for doc m:   169
  simplex[V] phi[K];     // word dist for topic k  :827
  real a;
  real lambda[K];
}
model {
  for (m in 1:M)
    theta[m] ~ dirichlet(alpha);  // prior
  for (k in 1:K)
    phi[k] ~ dirichlet(beta);     // prior
  for (n in 1:N) {
    real gamma[K];
    for (k in 1:K)
      gamma[k] = log(theta[doc[n], k]) + log(phi[k, w[n]]);
    target += log_sum_exp(gamma);  // likelihood;
  }
  for (m in 1:N)
    y[m] ~  bernoulli_logit(a + lambda * theta[m]); // issue here
}

martinmodrak · January 3, 2020, 12:34pm

Please always post the full error message, it makes things easier for us to investigate. In any case this looks like a type mismatch, note that Stan is picky about vector/matrix arithmetics and differentiates between row and column vectors. At first glance it seems you want lambda to be a row_vector, then the * operator would mean a dot-product. Some more info can be found at https://mc-stan.org/docs/2_19/reference-manual/arithmetic-expressions-section.html

Best of luck with your model!

Topic		Replies	Views
Running LDA in Stan Modeling cmdstan , techniques , specification , example-models	2	421	July 12, 2023
Trying to add a covariate in LDA topic model Modeling	1	729	January 3, 2020
LDA for word proportions fit issue Modeling fitting-issues	8	822	August 12, 2018
Label switching in Latent dirichlet allocation model in the manual Modeling	0	687	July 26, 2019
LDA: topics do not separate Modeling	3	606	January 3, 2020

Topic model with an outcome variable

Related topics