Naive Bayes Text Classification and LOO

Sven_Hormann · May 15, 2017, 9:11pm

Hi together,

I am pretty new to rstan and I want to perform a Naive Bayes classification first.
The setup of the classifier is already done, as one could use the code provided in the documentation.
Here is the code:
data {
// training data
int<lower=1> K; // num topics
int<lower=1> V; // num words
int<lower=1> M; // num docs
int<lower=1> N; // total word instances
int<lower=1,upper=K> z[M]; // topic for doc m
int<lower=1,upper=V> w[N]; // word n
int<lower=1,upper=M> doc[N]; // doc ID for word n
// hyperparameters
vector<lower=0>[K] alpha; // topic prior
vector<lower=0>[V] beta; // word prior
}
parameters {
simplex[K] theta; // topic prevalence
simplex[V] phi[K]; // word dist for topic k
}
model {
theta ~ dirichlet(alpha);
for (k in 1:K)
phi[k] ~ dirichlet(beta);
for (m in 1:M)
z[m] ~ categorical(theta);
for (n in 1:N)
w[n] ~ categorical(phi[z[doc[n]]]);
}

However, I am not sure of how to extract the log_lik correctly within the generated quantities block to perform loo or waic.
Could anyone help me with that?

Furthermore, is it possible to calculate the accuracy of the model?
I haven’t found a possibility for that until now.

Thanks for your help!
Sven

Sven_Hormann · May 16, 2017, 5:33pm

Trying the following as generated quantities, is throwing an error message.
generated quantities {
vector[M] log_lik;
for (m in 1:M)
log_lik[m] = categorical_lpmf(m | theta);
}

R-Code:
nbdata <- list(
K = 2,
V = 23,
M = 5644,
N = 129812,
z <- c(topicvector),
w <- c(dataTest),
doc <- c(docvector),
alpha <- c(1,1),
beta <- c(1/23, 1/23, 1/23, 1/23, 1/23, 1/23, 1/23, 1/23, 1/23, 1/23, 1/23, 1/23, 1/23, 1/23, 1/23, 1/23, 1/23, 1/23, 1/23, 1/23, 1/23, 1/23, 1/23)
)

Error Message:
[1] "Error in sampler$call_sampler(args_list[[i]]) : “
[2] " Exception thrown at line 30: categorical_log: Number of categories is 3, but must be in the interval [1, 2]”

There is no third category in the data, nor in the topicvector.

Bob_Carpenter · May 16, 2017, 6:02pm

There’s a discussion of naive Bayes in the manual. Before jumping into stats, I was doing natural language processing and wrote a blog post on why naive Bayes isn’t usually Bayesian and how to make it Bayesian:

Don’t worry—it’ll be Bayesian if you fit it with MCMC in Stan.

The line here is wrong for reasons other than lack of indenting (use triple grave accents (backticks) to set off code and use indentation):

parameters {
  simplex[K] theta;

generated quantities {
  vector[M] log_lik;
  for (m in 1:M)
    log_lik[m] = categorical_lpmf(m | theta);

You should be able to see the problem. theta is of size K, but you’re looping over size M.

avehtari · May 16, 2017, 8:23pm

What is the “thing” you want to predict and thus leave out in loo? A word? In that case, I think you should have in generated quantities
for (n in 1:N)
log_lik[n] = categorical_lpmf(w[n] | phi[z[doc[n]]]);

Aki

Bob_Carpenter · May 16, 2017, 11:07pm

Aki—you want to assume you know the true label/tag/category/code z[doc[n]] and condition on that? I don’t see how that makes sense for loo.

What I’d be inclined to do is one of the following:

Evaluate language model: Take the corpus of N documents, and for each one, estimate its probability conditioned on the other N - 1 documents. This assumes a latent z[doc[n]].
Evaluate classifier: Take the corpus of N documents and the known labels for them, and for each one, estimate the probability it assigns to the true label conditioned on the text and other N -1 documents.

Usually people in natural language focus on (2) or simple 0/1 loss variants for naive Bayes becuase naive Bayes isn’t a very effective language model compared to simple Markov models.

avehtari · May 17, 2017, 7:38am

Thanks, Bob. These make sense.

Aki

Topic		Replies	Views
Looking for examples of exact loo calculation and recombination with approximate loo for non-rstanarm stan models Modeling rstan , loo	1	398	March 14, 2021
CJS log likelihood Modeling loo	2	694	May 13, 2020
Calculating model averaged estimates using loo Modeling loo	8	1003	January 31, 2024
Fit statistics — troubles calculating log likelihood? Modeling specification , loo	6	1882	October 16, 2017
How to decide on the best model? Modeling loo	3	527	December 3, 2020

Naive Bayes Text Classification and LOO

Related Topics