Naive Bayes Text Classification and LOO


#1

Hi together,

I am pretty new to rstan and I want to perform a Naive Bayes classification first.
The setup of the classifier is already done, as one could use the code provided in the documentation.
Here is the code:
data {
// training data
int<lower=1> K; // num topics
int<lower=1> V; // num words
int<lower=1> M; // num docs
int<lower=1> N; // total word instances
int<lower=1,upper=K> z[M]; // topic for doc m
int<lower=1,upper=V> w[N]; // word n
int<lower=1,upper=M> doc[N]; // doc ID for word n
// hyperparameters
vector<lower=0>[K] alpha; // topic prior
vector<lower=0>[V] beta; // word prior
}
parameters {
simplex[K] theta; // topic prevalence
simplex[V] phi[K]; // word dist for topic k
}
model {
theta ~ dirichlet(alpha);
for (k in 1:K)
phi[k] ~ dirichlet(beta);
for (m in 1:M)
z[m] ~ categorical(theta);
for (n in 1:N)
w[n] ~ categorical(phi[z[doc[n]]]);
}

However, I am not sure of how to extract the log_lik correctly within the generated quantities block to perform loo or waic.
Could anyone help me with that?

Furthermore, is it possible to calculate the accuracy of the model?
I haven’t found a possibility for that until now.

Thanks for your help!
Sven


#2

Trying the following as generated quantities, is throwing an error message.
generated quantities {
vector[M] log_lik;
for (m in 1:M)
log_lik[m] = categorical_lpmf(m | theta);
}

R-Code:
nbdata <- list(
K = 2,
V = 23,
M = 5644,
N = 129812,
z <- c(topicvector),
w <- c(dataTest),
doc <- c(docvector),
alpha <- c(1,1),
beta <- c(1/23, 1/23, 1/23, 1/23, 1/23, 1/23, 1/23, 1/23, 1/23, 1/23, 1/23, 1/23, 1/23, 1/23, 1/23, 1/23, 1/23, 1/23, 1/23, 1/23, 1/23, 1/23, 1/23)
)

Error Message:
[1] "Error in sampler$call_sampler(args_list[[i]]) : “
[2] " Exception thrown at line 30: categorical_log: Number of categories is 3, but must be in the interval [1, 2]”

There is no third category in the data, nor in the topicvector.


#3

There’s a discussion of naive Bayes in the manual. Before jumping into stats, I was doing natural language processing and wrote a blog post on why naive Bayes isn’t usually Bayesian and how to make it Bayesian:

Don’t worry—it’ll be Bayesian if you fit it with MCMC in Stan.

The line here is wrong for reasons other than lack of indenting (use triple grave accents (backticks) to set off code and use indentation):

parameters {
  simplex[K] theta;

generated quantities {
  vector[M] log_lik;
  for (m in 1:M)
    log_lik[m] = categorical_lpmf(m | theta);

You should be able to see the problem. theta is of size K, but you’re looping over size M.


#4

What is the “thing” you want to predict and thus leave out in loo? A word? In that case, I think you should have in generated quantities
for (n in 1:N)
log_lik[n] = categorical_lpmf(w[n] | phi[z[doc[n]]]);

Aki


#5

Aki—you want to assume you know the true label/tag/category/code z[doc[n]] and condition on that? I don’t see how that makes sense for loo.

What I’d be inclined to do is one of the following:

  1. Evaluate language model: Take the corpus of N documents, and for each one, estimate its probability conditioned on the other N - 1 documents. This assumes a latent z[doc[n]].

  2. Evaluate classifier: Take the corpus of N documents and the known labels for them, and for each one, estimate the probability it assigns to the true label conditioned on the text and other N -1 documents.

Usually people in natural language focus on (2) or simple 0/1 loss variants for naive Bayes becuase naive Bayes isn’t a very effective language model compared to simple Markov models.


#6

Thanks, Bob. These make sense.

Aki