I want to include a covariate into a LDA model with meta for each documents. For example, the age of the person writing the document.
The generative process I am thinking about is:
for each topic k:
draw \lambda_k \sim Normal(0,1)
draw \phi_k \sim Dir(\beta)
for each document d:
for each topic k, \alpha_{dk} = exp(X_d^T \lambda_k)
draw \theta_d \sim Dir(\alpha_d)
for each word i,
draw z_i \sim Cat(\theta_d)
draw w_i \sim Cat(\phi_{z_i})
This is a simplified version of ( Topic Models Conditioned on Arbitrary Features) https://mimno.infosci.cornell.edu/papers/dmr-uai.pdf
My code is below (my model runs, but I feel I am putting the \theta_m variable in the wrong place). I want to know if people in different ages speaks differently. So, I am thinking the model written this way seems fine. But I do not know where to declare \theta, if I declare it in the model part, Stan gives me an error saying that there is no simplex. But if I put in the parameters part, then I will have both \theta and \lambda. This does not look right to me. Would greatly appreciate any insight!!!
data {
int<lower=2> K; // num topics
int<lower=2> V; // num words
int<lower=1> M; // num docs
int<lower=1> N; // total word instances
int<lower=1,upper=V> w[N]; // word n
int<lower=1,upper=M> doc[N]; // doc ID for word n
vector<lower=1>[M] G; // gender information
// vector<lower=0>[K] alpha; // topic prior
vector<lower=0>[V] beta; // word prior
real sigma; // simplify
}
parameters {
simplex[V] phi[K]; // word dist for topic k :827
simplex[K] theta[M]; // topic dist
real lambda[K]; // !!! not sure here what happens if have both parameters
}
model {
for (k in 1:K){
lambda[k] ~ normal(0, sigma); // here only gender info, so it is a scalor
phi[k] ~ dirichlet(beta);
}
for (m in 1:M){
vector[K] alpha;
for (k in 1:K){
alpha[k] = exp(G[m] * lambda[k]);
}
theta[m] ~ dirichlet(alpha);
}
for (n in 1:N) {
real gamma[K];
for (k in 1:K)
gamma[k] = log(theta[doc[n], k]) + log(phi[k, w[n]]);
target += log_sum_exp(gamma);
}
}