Question
Code in stan a naive classifier model with both categorical and continuous predictors
Data
Predicting penguin species given bill and flipper length. Three species of penguins observed. Bill and flipper length are continuous.
data("penguins_bayes",package="bayesrules")
head(penguins_bayes)
Naive classifier in R
I was able to implement a naive conditional independence model in R:
df_2p <- penguins_bayes %>% select(species,bill_length_mm,flipper_length_mm)
py <- table(df_2p$species) / nrow(df_2p)
theta_x1 <- df_2p %>% group_by(species) %>% summarise(
Mean = mean(bill_length_mm,na.rm = T),
SD= sd(bill_length_mm,na.rm = T)
)
theta_x2 <- df_2p %>% group_by(species) %>%
summarise(
Mean = mean(flipper_length_mm,na.rm = T),
SD = sd(flipper_length_mm,na.rm = T)
)
# observed a penguin with bill of 50 mm and flipper of 195 mm
# what's the probability it belongs to each of the three species?
our_penguin <- data.frame(bill_length_mm = 50, flipper_length_mm = 195)
# for example, the probability our penguin is the third species
py[3] * dnorm(our_penguin$bill_length_mm[1],theta_x1$Mean[3],theta_x1$SD[3]) *
dnorm(our_penguin$flipper_length_mm[1],theta_x2$Mean[3],theta_x2$SD[3]) /
(
py[1] * dnorm(our_penguin$bill_length_mm[1],theta_x1$Mean[1],theta_x1$SD[1]) *
dnorm(our_penguin$flipper_length_mm[1],theta_x2$Mean[1],theta_x2$SD[1]) +
py[2] * dnorm(our_penguin$bill_length_mm[1],theta_x1$Mean[2],theta_x1$SD[2]) *
dnorm(our_penguin$flipper_length_mm[1],theta_x2$Mean[2],theta_x2$SD[2]) +
py[3] * dnorm(our_penguin$bill_length_mm[1],theta_x1$Mean[3],theta_x1$SD[3]) *
dnorm(our_penguin$flipper_length_mm[1],theta_x2$Mean[3],theta_x2$SD[3])
)
# the e1071 package has a naive_bayes function to compute naive classification
naive_model_2p <- e1071::naiveBayes(species ~ bill_length_mm + flipper_length_mm,
data = df_2p)
predict(naive_model_2p,newdata=our_penguin,type = "raw")
# my result is consistnt with e1071
Implementation in Stan
The above implementation assumes a normal distribution for continous predictors conditional on each level of y, whose mean and standard deviation are just sample mean and standard deviation.
I would like to have more flexible specifications and also integrate categorical predictors.
To start, I want to replicate the same assumption in Stan.
data {
int<lower=0> N; // # of dobservations
int D_y; // dimension of categorical y
array[N] int<lower=1,upper=D_y> y; // y values
vector[D_y] p_y; // marginal probability of y
int D_x_con; // number of continous predictors x
matrix[N,D_x_con] x_con; // continuous x
}
parameters {
# assume normal distrbution on x, conditional on each level of y
array[D_y] vector[D_x_con] mu_x; // mean
array[D_y] vector[D_x_con] sigma_x; // standard deviation
simplex[D_y] theta_y; // categorical probability
}
model {
array[D_y] vector[D_x_con] Lpx_y; // log conditional probability
vector[D_y] Lpxy; // marginal log likelihood
for(n in 1:N){
int dy; // conditioning on y level
dy = y[n];
for(dx in 1:D_x_con){
Lpx_y[dy,dx] = normal_lpdf(x_con[n,dx] | mu_x[dy,dx], sigma_x[dy,dx]); // calculate conditional probability
// Lpx_y[dy] = normal_lpdf(x_con[n] | mu_x[dy], sigma_x[dy]);
}
Lpxy[dy] = log(p_y[dy]) + sum(Lpx_y[dy]); // naive assumption of conditional independence
theta_y[dy] = exp(Lpxy[dy] - sum(Lpxy)); // calculate posterior probability
}
y ~ categorical(theta_y);
}
Syntax error
simplex declaration of theta_y cannot be made in the model block;
but if theta_y declared in the parameter block, then it is a global variable.
Computation at line 26 cannot be assigned to global variable in previous block.