Naive classifier implementation

Question

Code in stan a naive classifier model with both categorical and continuous predictors

Data

Predicting penguin species given bill and flipper length. Three species of penguins observed. Bill and flipper length are continuous.

data("penguins_bayes",package="bayesrules")
head(penguins_bayes)

Naive classifier in R

I was able to implement a naive conditional independence model in R:

df_2p <- penguins_bayes %>% select(species,bill_length_mm,flipper_length_mm)
py <- table(df_2p$species) / nrow(df_2p)
theta_x1 <- df_2p %>% group_by(species) %>% summarise(
  Mean = mean(bill_length_mm,na.rm = T),
  SD= sd(bill_length_mm,na.rm = T)
) 
theta_x2 <- df_2p %>% group_by(species) %>%
  summarise(
    Mean = mean(flipper_length_mm,na.rm = T),
    SD = sd(flipper_length_mm,na.rm = T)
  )
# observed a penguin with bill of 50 mm and flipper of 195 mm
# what's the probability it belongs to each of the three species?
our_penguin <- data.frame(bill_length_mm = 50, flipper_length_mm = 195)
# for example, the probability our penguin is the third species
py[3] * dnorm(our_penguin$bill_length_mm[1],theta_x1$Mean[3],theta_x1$SD[3]) *
  dnorm(our_penguin$flipper_length_mm[1],theta_x2$Mean[3],theta_x2$SD[3]) /
  (
    py[1] * dnorm(our_penguin$bill_length_mm[1],theta_x1$Mean[1],theta_x1$SD[1]) *
      dnorm(our_penguin$flipper_length_mm[1],theta_x2$Mean[1],theta_x2$SD[1]) +
    py[2] * dnorm(our_penguin$bill_length_mm[1],theta_x1$Mean[2],theta_x1$SD[2]) *
      dnorm(our_penguin$flipper_length_mm[1],theta_x2$Mean[2],theta_x2$SD[2]) +
    py[3] * dnorm(our_penguin$bill_length_mm[1],theta_x1$Mean[3],theta_x1$SD[3]) *
      dnorm(our_penguin$flipper_length_mm[1],theta_x2$Mean[3],theta_x2$SD[3])
  )  
# the e1071 package has a naive_bayes function to compute naive classification
naive_model_2p <- e1071::naiveBayes(species ~ bill_length_mm + flipper_length_mm,
                        data = df_2p)
predict(naive_model_2p,newdata=our_penguin,type = "raw")
# my result is consistnt with e1071

Implementation in Stan

The above implementation assumes a normal distribution for continous predictors conditional on each level of y, whose mean and standard deviation are just sample mean and standard deviation.
I would like to have more flexible specifications and also integrate categorical predictors.
To start, I want to replicate the same assumption in Stan.

data {
  int<lower=0> N;                          // # of dobservations
  int D_y;                                 // dimension of categorical y
  array[N] int<lower=1,upper=D_y> y;       // y values
  vector[D_y] p_y;                         // marginal probability of y
  int D_x_con;                             // number of continous predictors x
  matrix[N,D_x_con] x_con;                 // continuous x
}
parameters { 
  # assume normal distrbution on x, conditional on each level of y
  array[D_y] vector[D_x_con] mu_x;         // mean
  array[D_y] vector[D_x_con] sigma_x;      // standard deviation
  simplex[D_y] theta_y;                    // categorical probability
}
model {
  array[D_y] vector[D_x_con] Lpx_y;        // log conditional probability
  vector[D_y] Lpxy;                        // marginal log likelihood
  for(n in 1:N){
    int dy;                                // conditioning on y level
    dy = y[n];
    for(dx in 1:D_x_con){
      Lpx_y[dy,dx] = normal_lpdf(x_con[n,dx] | mu_x[dy,dx], sigma_x[dy,dx]);    // calculate conditional probability
    // Lpx_y[dy] = normal_lpdf(x_con[n] | mu_x[dy], sigma_x[dy]);
    }
    Lpxy[dy] = log(p_y[dy]) + sum(Lpx_y[dy]);                                   // naive assumption of conditional independence
    theta_y[dy] = exp(Lpxy[dy] - sum(Lpxy));                                    // calculate posterior probability
  }
  y ~ categorical(theta_y);                                               
}

Syntax error

simplex declaration of theta_y cannot be made in the model block;
but if theta_y declared in the parameter block, then it is a global variable.
Computation at line 26 cannot be assigned to global variable in previous block.

This looks like a likelihood function, not a Bayesian model. The distribution here is completely deterministically calculated based on the data and the selected parameters, but the parameters have no priors over them, so there is nothing to sample. In your model block, you’ll want:

model{
mu_x ~ distibution()
sigma_x ~ distribution()
target += categorical(y|function(D_y,D_x_con,N,y,x_con,mu_x,sigma_x,p_y))
}

But with the prior distributions you want instead of distribution(), where the function, which you should rename to something unique, reads in all of the necessary data and parameters. Then in your functions block:

functions {
//function to calculate theta_y
simplex function(D_y,D_x_con,N,y,x_con,mu_x,sigma_x,p_y) {
  simplex[D_y] theta_y;                    // categorical probability
  array[D_y] vector[D_x_con] Lpx_y;        // log conditional probability
  vector[D_y] Lpxy;                        // marginal log likelihood
  for(n in 1:N){
    int dy;                                // conditioning on y level
    dy = y[n];
    for(dx in 1:D_x_con){
      Lpx_y[dy,dx] = normal_lpdf(x_con[n,dx] | mu_x[dy,dx], sigma_x[dy,dx]);    // calculate conditional probability
    // Lpx_y[dy] = normal_lpdf(x_con[n] | mu_x[dy], sigma_x[dy]);
    }
    Lpxy[dy] = log(p_y[dy]) + sum(Lpx_y[dy]);                                   // naive assumption of conditional independence
    theta_y[dy] = exp(Lpxy[dy] - sum(Lpxy));                                    // calculate posterior probability
  }
return theta_y
 }
}

But with the unique function name you selected, and replacing all of the global variables with local ones. Your parameters block should then look like:

parameters { 
  # assume normal distrbution on x, conditional on each level of y
  array[D_y] vector[D_x_con] mu_x;         // mean
  array[D_y] vector[D_x_con] sigma_x;      // standard deviation
}

I haven’t tested it, but, assuming the model itself works, this is what I think you need to do. Also keep in mind that there may be much more computationally efficient ways to implement this, but I haven’t shown any of that here.

Edit: I should point out that what you’re asking the sampler to do here is find you the distribution of mu_x and sigma_x as the parameters of interest that you’re hoping to get back estimates for from the model.

Thanks for the idea! Indeed I forgot to put a prior on the parameters mu_x and sigma_x. With some priors, the model blocks should look like:

model {
  for(dy in 1:D_y){
    mu_x[dy] ~ std_normal();
    sigma_x[dy] ~ gamma(2,0.1);
  }
  // likelihood functions
  ...                                              
}

For the likelihood function, the only obstacle is the declaration of unit simplex. Simplex is by definition a vector with constraints and indeed coded as such in stan. The constraints are not explicitly stated in the declaration, but they exist and are validated afterwards.
Therefore, I recall I heard from Andrew earlier that [Parser requires integer length for vector after supplying integer length - #2 by andrjohns]

Constraints aren’t used in the model block

The same problem persists even if the likelihood function is moved to the function block. To the best of my knowledge, declaration with constraints is allowed in the data, parameter and generated quantities blocks.
Indeed, when the declaration on theta_y (and the function should the likelihood function moved to the function block) is changed from simplex to vector, this code compiles now. Yet this voids the constraints on theta_y.

functions{
  vector naive_con(int D_y, int D_x_con,int N, array[] int y, matrix x_con,vector p_y,array[] vector mu_x,array[] vector sigma_x){
      vector[D_y] theta_y;                    // categorical probability
      array[D_y] vector[D_x_con] Lpx_y;        // log conditional probability
      vector[D_y] Lpxy;                        // marginal log likelihood
      for(n in 1:N){
        int dy;                                // conditioning on y level
        dy = y[n];
        for(dx in 1:D_x_con){
          Lpx_y[dy,dx] = normal_lpdf(x_con[n,dx] | mu_x[dy,dx], sigma_x[dy,dx]);    // calculate conditional probability
          // Lpx_y[dy] = normal_lpdf(x_con[n] | mu_x[dy], sigma_x[dy]);
        }
        Lpxy[dy] = log(p_y[dy]) + sum(Lpx_y[dy]);                                   // naive assumption of conditional independence
        theta_y[dy] = exp(Lpxy[dy] - sum(Lpxy));                                    // calculate posterior probability
      }
      return(theta_y);
  }
}

I am trying to come up with some post-declaration methods to constrain theta_y.

I’m not sure what you mean by a naive classifier. Do you mean naive Bayes? Ironically, naive Bayes isn’t usually implemented with Bayesian inference in the machine learning world.

It’s usually better to build a logistic regression, for which there is a lot of guidance in the regression chapter of our User’s Guide.

For pure naive Bayes, what you need is an independent generative model for all of the components of an observation. Working without tuples in Stan, you need something like this if you have, let’s say 2 discrete covariates (aka features), one on a 1–5 scale and one binary, and 2 continuous predictors, one bounded below by zero. I’m going to make up arbitrary distributions for all of these—the trick is choosing good ones. We have K copies of the parameters for each of our covariates.

data {
  int<lower=2> K;                    // number of categories
  int<lower=0> N;                    // number of training items
  array[N] real x1;                  // covariates
  array[N] real<lower=0> x2;
  array[N] int<lower=1, upper=5> u1;
  array[N] int<lower=0, upper=1> u2;
  array[N] int<lower=1, upper=K> y;  // observations
}
parameters {
  simplex[K] theta;                       // population category distribution
  array[K] real mu1;                      // first coviarate params
  array[K] real<lower=0>[K] sigma1;
  array[K] real<lower=0> lambda2;         // second covariate param
  array[K] simplex[5] alpha3;             // third covariate param
  array[K] real<lower=0, upper=1> beta4;  // fourth covariate param
}
model {
  for (n in 1:N) {
     vector[K] lp = log(theta);  // unnormalized log Pr(Y[n] = k | x[n], u[n])
     for (k in 1:K) {
       lp[k] += normal_lpdf(x1[n] | mu1[k], sigma1[k]);  // first real covariate prob
       lp[k] += exponential_lpdf(x2[n] | lambda2[k]);    // second real covariate log pdf
       lp[k] += categorical_lpmf(u1[n] | alpha3[k]);     // first discrete covariate log pmf
       lp[k] += bernoulli_lpmf(u2[n] | beta4[k]);        // second discrete covariate log pmf
     }     
     y[n] ~ categorical_logit(lp);   // eq.  y[n] ~ categorical(exp(lp - log_sum_exp(lp)))
  }
}

I haven’t tied compiling this code or simulating data, but it’s the right math. In Stan, all the parameters implicitly have a uniform prior. This is improper in terms of real unbounded parameters (all but alpha3 and beta4). Priors for the remaining parameters can be added to the model block.

With Stan, you can fit this model with maximum likelihood using optimization, and you an add a penalty function (like a prior, but conceptually different) for penalized maximum likelihood.

1 Like

There are no constraints allowed in either function arguments or local variables. It is up to the caller to ensure only valid values are passed.

The constraints work differently for parameters. For data, transformed data, transformed parameters, and generated quantities, the constraints are indeed validated at the end of the block and rejected if not satisified (this is a bad way to do rejection sampling!). For parameters, the constraints are used to transform unconstrained parameters to satisfy the constraints. Thus parameter constraints will always be satisfied with the exception of numerical precision issues.

1 Like