Topic model with an outcome variable

I am interested in relating LDA to an outcome variable (building the outcome into the generative model rather than working LDA outputs as features for another classification/regression task).

The generative process I am thinking of is:
for each topic k:
draw \phi_k \sim Dir(\beta)
for each document d:
for each topic k:
draw \theta_d \sim Dir(\alpha)
for each word i,
draw z_i \sim Cat(\theta_d)
draw w_i \sim Cat(\theta_{z_i})
draw outcome y_d \sim bernoulli_logit(a + \lambda * \theta_d)

I get an error in the last line of the model part.

And also it seems that in my model part, the last loop of going over 1 to M is unnecessary. However, for the text data, I followed the format of the stan reference manual where https://mc-stan.org/docs/2_18/stan-users-guide/naive-bayes-classification-and-clustering.html

The relevant variables for the program are N , the total number of words in all the documents, the word array w , and the document identity array doc .

So, I am not entirely sure how I should modify the last line of my model.

My code is the following:

data {
  int<lower=2> K;               // num topics
  int<lower=2> V;               // num words
  int<lower=1> M;               // num docs
  int<lower=1> N;               // total word instances
  int<lower=1,upper=V> w[N];    // word n
  int<lower=1,upper=M> doc[N];  // doc ID for word n
  int y[M]; // outcome variable 
  vector<lower=0>[K] alpha;     // topic prior
  vector<lower=0>[V] beta;      // word prior
}
parameters {
  simplex[K] theta[M];   // topic dist for doc m:   169
  simplex[V] phi[K];     // word dist for topic k  :827
  real a;
  real lambda[K];
}
model {
  for (m in 1:M)
    theta[m] ~ dirichlet(alpha);  // prior
  for (k in 1:K)
    phi[k] ~ dirichlet(beta);     // prior
  for (n in 1:N) {
    real gamma[K];
    for (k in 1:K)
      gamma[k] = log(theta[doc[n], k]) + log(phi[k, w[n]]);
    target += log_sum_exp(gamma);  // likelihood;
  }
  for (m in 1:N)
    y[m] ~  bernoulli_logit(a + lambda * theta[m]); // issue here
}

Please always post the full error message, it makes things easier for us to investigate. In any case this looks like a type mismatch, note that Stan is picky about vector/matrix arithmetics and differentiates between row and column vectors. At first glance it seems you want lambda to be a row_vector, then the * operator would mean a dot-product. Some more info can be found at https://mc-stan.org/docs/2_19/reference-manual/arithmetic-expressions-section.html

Best of luck with your model!

2 Likes