Hi all
The Stan user manual provides guidance on a few ways of classifying unlabeled training
data after training a model with labeled data. It is described this way, starting with the labeled training model code, which is provided:
A second document collection is declared as data, but without the category labels, leading to new variables M2 N2, w2, doc2. The number of categories and number of words, as well as the hyperparameters are shared and only declared once. Similarly, there is only one set of parameters. Then the model contains a single set of statements for the prior, a set of statements for the labeled data, and a set of statements for the unlabeled data.
I believe I did that part right… below:
My model code:
data {
// training data
int<lower=1> K; // num topics
int<lower=1> V; // num words
int<lower=0> M; // num docs
int<lower=0> N; // total word instances
int<lower=1,upper=K> z[M]; // topic for doc m
int<lower=1,upper=V> w[N]; // word n
int<lower=1,upper=M> doc[N]; // doc ID for word n
// unlabeled data
int<lower=0> N2; // total word instances
int<lower=0> M2; // num docs
int<lower=1,upper=V> w2[N2]; // word n
int<lower=1,upper=M2> doc2[N2]; // doc ID for word n
// hyperparameters
vector<lower=0>[K] alpha; // topic prior
vector<lower=0>[V] beta; // word prior
}
parameters {
simplex[K] theta; // topic prevalence
simplex[V] phi[K]; // word dist for topic k
}
model {
real gamma[M2, K];
// priors
theta ~ dirichlet(alpha);
for (k in 1:K)
phi[k] ~ dirichlet(beta);
// likelihood, including latent category
for (m in 1:M)
z[m] ~ categorical(theta);
for (n in 1:N)
w[n] ~ categorical(phi[z[doc[n]]]);
// unlabeled data
for (m in 1:M2)
for (k in 1:K)
gamma[m, k] = categorical_lpmf(k | theta);
for (n in 1:N2)
for (k in 1:K)
gamma[doc2[n], k] = gamma[doc2[n], k] + categorical_lpmf(w2[n] | phi[k]);
for (m in 1:M2)
target += log_sum_exp(gamma[m]);
}
I made a very simple data sample to supplement what’s in the trained example:
# docs
M2 <- 4
# total words
N2 <- 40
# unlabeled word sample
w2 <- c(1L, 7L, 9L, 5L, 5L, 9L, 5L, 5L, 4L,
8L, 1L, 9L, 7L, 6L, 7L, 7L, 7L, 9L, 6L, 8L,
5L, 1L, 5L, 1L, 4L, 9L, 9L, 7L, 2L, 7L,
5L, 5L, 6L, 5L, 6L, 1L, 5L, 6L, 3L, 6L)
doc2 <- c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L)
What I’m stuck on is the other part in the manual, which describes how to shift things around to get posterior probabilities:
If the variable gamma were declared and defined in the transformed parameter block,
its sampled values would be saved by Stan. The normalized posterior probabilities
could also be defined as generated quantities
It also says:
An alternative to full Bayesian inference involves estimating a model using labeled
data, then applying it to unlabeled data without updating the parameter estimates
based on the unlabeled data. This behavior can be implemented by moving the definition of gamma for the unlabeled documents to the generated quantities block. Because
the variables no longer contribute to the log probability, they no longer jointly contribute to the estimation of the model parameters
Any help putting this all together, either way, is appreciated. I’ve posted my attempt at the latter below. Will continue to dive in and update… am doing this as an exercise and realize Naive Bayes’s limitations.
This model throws a rh/lh assignment error when trying to compute the log exp sum. I’m not sure I need that line, though? I don’t understand what it adds in the context of naive bayes.
data {
// training data
int<lower=1> K; // num topics
int<lower=1> V; // num words
int<lower=0> M; // num docs
int<lower=0> N; // total word instances
int<lower=1,upper=K> z[M]; // topic for doc m
int<lower=1,upper=V> w[N]; // word n
int<lower=1,upper=M> doc[N]; // doc ID for word n
// unlabeled data
int<lower=0> N2; // total word instances
int<lower=0> M2; // num docs
int<lower=1,upper=V> w2[N2]; // word n
int<lower=1,upper=M2> doc2[N2]; // doc ID for word n
// hyperparameters
vector<lower=0>[K] alpha; // topic prior
vector<lower=0>[V] beta; // word prior
}
parameters {
simplex[K] theta; // topic prevalence
simplex[V] phi[K]; // word dist for topic k
}
model {
// priors
theta ~ dirichlet(alpha);
for (k in 1:K)
phi[k] ~ dirichlet(beta);
// likelihood, including latent category
z[m] ~ categorical(theta);
for (n in 1:N)
w[n] ~ categorical(phi[z[doc[n]]]);
}
generated quantities {
real gamma[M2, K];
for (m in 1:M2)
for (k in 1:K)
gamma[m, k] = categorical_lpmf(k | theta);
for (n in 1:N2)
for (k in 1:K)
gamma[doc2[n], k] += categorical_lpmf(w2[n] | phi[k]);
# Should I return a vector with the index value k of the the posterior mode for each m?
}
@Bob_Carpenter, who has taught this.