Naive bayes: combining unlabeled and labeled data

specification

#1

Hi all

The Stan user manual provides guidance on a few ways of classifying unlabeled training
data after training a model with labeled data. It is described this way, starting with the labeled training model code, which is provided:

A second document collection is declared as data, but without the category labels, leading to new variables M2 N2, w2, doc2. The number of categories and number of words, as well as the hyperparameters are shared and only declared once. Similarly, there is only one set of parameters. Then the model contains a single set of statements for the prior, a set of statements for the labeled data, and a set of statements for the unlabeled data.

I believe I did that part right… below:

My model code:

     data {
      // training data
      int<lower=1> K;               // num topics
      int<lower=1> V;               // num words
      int<lower=0> M;               // num docs
      int<lower=0> N;               // total word instances
      int<lower=1,upper=K> z[M];    // topic for doc m
      int<lower=1,upper=V> w[N];    // word n
      int<lower=1,upper=M> doc[N];  // doc ID for word n
      
      // unlabeled data
      int<lower=0> N2;               // total word instances
      int<lower=0> M2;               // num docs
      int<lower=1,upper=V> w2[N2];    // word n
      int<lower=1,upper=M2> doc2[N2];  // doc ID for word n
      
      // hyperparameters
      vector<lower=0>[K] alpha;     // topic prior
      vector<lower=0>[V] beta;      // word prior
    }
    parameters {
      simplex[K] theta;   // topic prevalence
      simplex[V] phi[K];  // word dist for topic k
    }
    model {
      real gamma[M2, K];
      // priors
      theta ~ dirichlet(alpha);
      for (k in 1:K)  
        phi[k] ~ dirichlet(beta);
      // likelihood, including latent category
      for (m in 1:M)
        z[m] ~ categorical(theta);
      for (n in 1:N)
        w[n] ~ categorical(phi[z[doc[n]]]);
      // unlabeled data
      for (m in 1:M2)
        for (k in 1:K)
          gamma[m, k] = categorical_lpmf(k | theta);
      for (n in 1:N2)
        for (k in 1:K)
          gamma[doc2[n], k] = gamma[doc2[n], k] + categorical_lpmf(w2[n] | phi[k]);
      for (m in 1:M2)
        target += log_sum_exp(gamma[m]);
    }

I made a very simple data sample to supplement what’s in the trained example:

# docs
M2 <- 4
# total words
N2 <- 40

# unlabeled word sample
w2 <- c(1L, 7L, 9L, 5L, 5L, 9L, 5L, 5L, 4L, 
        8L, 1L, 9L, 7L, 6L, 7L, 7L, 7L, 9L, 6L, 8L,
        5L, 1L, 5L, 1L, 4L, 9L, 9L, 7L, 2L, 7L,
        5L, 5L, 6L, 5L, 6L, 1L, 5L, 6L, 3L, 6L)
doc2 <- c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
          2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
          3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
          4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L)

What I’m stuck on is the other part in the manual, which describes how to shift things around to get posterior probabilities:

If the variable gamma were declared and defined in the transformed parameter block,
its sampled values would be saved by Stan. The normalized posterior probabilities
could also be defined as generated quantities

It also says:

An alternative to full Bayesian inference involves estimating a model using labeled
data, then applying it to unlabeled data without updating the parameter estimates
based on the unlabeled data. This behavior can be implemented by moving the definition of gamma for the unlabeled documents to the generated quantities block. Because
the variables no longer contribute to the log probability, they no longer jointly contribute to the estimation of the model parameters

Any help putting this all together, either way, is appreciated. I’ve posted my attempt at the latter below. Will continue to dive in and update… am doing this as an exercise and realize Naive Bayes’s limitations.

This model throws a rh/lh assignment error when trying to compute the log exp sum. I’m not sure I need that line, though? I don’t understand what it adds in the context of naive bayes.

    data {
      // training data
      int<lower=1> K;               // num topics
      int<lower=1> V;               // num words
      int<lower=0> M;               // num docs
      int<lower=0> N;               // total word instances
      int<lower=1,upper=K> z[M];    // topic for doc m
      int<lower=1,upper=V> w[N];    // word n
      int<lower=1,upper=M> doc[N];  // doc ID for word n
      
      // unlabeled data
      int<lower=0> N2;               // total word instances
      int<lower=0> M2;               // num docs
      int<lower=1,upper=V> w2[N2];    // word n
      int<lower=1,upper=M2> doc2[N2];  // doc ID for word n
      
      // hyperparameters
      vector<lower=0>[K] alpha;     // topic prior
      vector<lower=0>[V] beta;      // word prior
    }
    parameters {
      simplex[K] theta;   // topic prevalence
      simplex[V] phi[K];  // word dist for topic k
    }
    model {
      // priors
      theta ~ dirichlet(alpha);
      for (k in 1:K)  
        phi[k] ~ dirichlet(beta);
      // likelihood, including latent category
      z[m] ~ categorical(theta);
      for (n in 1:N)
        w[n] ~ categorical(phi[z[doc[n]]]);
    }
    generated quantities {
      real gamma[M2, K];
      for (m in 1:M2)
        for (k in 1:K)
          gamma[m, k] = categorical_lpmf(k | theta);
      for (n in 1:N2)
        for (k in 1:K)
          gamma[doc2[n], k] += categorical_lpmf(w2[n] | phi[k]);
      # Should I return a vector with the index value k of the the posterior mode for each m?

    }

@Bob_Carpenter, who has taught this.


#2

Hi, sorry I don’t understand the problem deeply, but target += log_sum_exp(gamma[m]); won’t work in generated quantities - there, the target (posterior density) is already fixed. I think the line should be missing as you are applying it to unlabeled data without updating the parameter estimates based on the unlabeled data., i.e. it seems that not involving gamma in the target density is intentional.


#3

@martinmodrak

Hi, sorry I don’t understand the problem deeply, but target += log_sum_exp(gamma[m]); won’t work in generated quantities - there, the target (posterior density) is already fixed. I think the line should be missing as you are applying it to unlabeled data without updating the parameter estimates based on the unlabeled data. , i.e. it seems that not involving gamma in the target density is intentional.> Blockquote

Hi, yes that seems right. This was my first stan model beyond simple linear/hierachical modeling, so just learning the ropes here.

It seems like I can actually just leave off that last line if I don’t care about P(gamma[m,]). It seems like I can just look at the distribution of gamma[m,n] to determine the posterior mode for each document.

Alternatively, if I wanted P(gamma[m,)`, I believe I can generate it with:

for (m in 1:M2)
    gammas[m] = log_sum_exp(gamma[m]);

where gammas is defined it the top of the block as: real gammas[M2];