I am interested in relating LDA to an outcome variable (building the outcome into the generative model rather than working LDA outputs as features for another classification/regression task).
The generative process I am thinking of is:
for each topic k:
draw \phi_k \sim Dir(\beta)
for each document d:
for each topic k:
draw \theta_d \sim Dir(\alpha)
for each word i,
draw z_i \sim Cat(\theta_d)
draw w_i \sim Cat(\theta_{z_i})
draw outcome y_d \sim bernoulli_logit(a + \lambda * \theta_d)
I get an error in the last line of the model part.
And also it seems that in my model part, the last loop of going over 1 to M is unnecessary. However, for the text data, I followed the format of the stan reference manual where https://mc-stan.org/docs/2_18/stan-users-guide/naive-bayes-classification-and-clustering.html
The relevant variables for the program are
N
, the total number of words in all the documents, the word arrayw
, and the document identity arraydoc
.
So, I am not entirely sure how I should modify the last line of my model.
My code is the following:
data {
int<lower=2> K; // num topics
int<lower=2> V; // num words
int<lower=1> M; // num docs
int<lower=1> N; // total word instances
int<lower=1,upper=V> w[N]; // word n
int<lower=1,upper=M> doc[N]; // doc ID for word n
int y[M]; // outcome variable
vector<lower=0>[K] alpha; // topic prior
vector<lower=0>[V] beta; // word prior
}
parameters {
simplex[K] theta[M]; // topic dist for doc m: 169
simplex[V] phi[K]; // word dist for topic k :827
real a;
real lambda[K];
}
model {
for (m in 1:M)
theta[m] ~ dirichlet(alpha); // prior
for (k in 1:K)
phi[k] ~ dirichlet(beta); // prior
for (n in 1:N) {
real gamma[K];
for (k in 1:K)
gamma[k] = log(theta[doc[n], k]) + log(phi[k, w[n]]);
target += log_sum_exp(gamma); // likelihood;
}
for (m in 1:N)
y[m] ~ bernoulli_logit(a + lambda * theta[m]); // issue here
}