Hello Everyone,
I am trying to run LDA using cmdstan, to do this, I am facing two problems:
1- what is the format of the data that is accepted by Stan to perform LDA. In other words how the input should be?
2- After running stan, I believe that the output will be samples from the posterior. The goal of LDA is to assign a topic for each word. How to move obtain topics for each word from those posterior samples.
The model I am using is the one presented in the documentation:
data {
int<lower=2> K; // num topics
int<lower=2> V; // num words
int<lower=1> M; // num docs
int<lower=1> N; // total word instances
int<lower=1,upper=V> w[N]; // word n
int<lower=1,upper=M> doc[N]; // doc ID for word n
vector<lower=0>[K] alpha; // topic prior
vector<lower=0>[V] beta; // word prior
}
parameters {
simplex[K] theta[M]; // topic dist for doc m
simplex[V] phi[K]; // word dist for topic k
}
model {
for (m in 1:M)
theta[m] ~ dirichlet(alpha); // prior
for (k in 1:K)
phi[k] ~ dirichlet(beta); // prior
for (n in 1:N) {
real gamma[K];
for (k in 1:K)
gamma[k] = log(theta[doc[n], k]) + log(phi[k, w[n]]);
target += log_sum_exp(gamma); // likelihood;
}
}