Model to calculate Misclassification error based on test data

Hi Everyone,

In the following stan model, I have fitted a logistic regression model for training data. Then I want to calculate the misclassification error based on test data. For that I have first obtained the predicted probabilities for test data.

data {
  int<lower=1> N1;
  int<lower=1> N2;
  int<lower=1> K1; 
  int<lower=0,upper=1> yt[N1]; //response of training data
  matrix[N1,K1] x1;//training data matrix
  matrix[N2,K1] x1h; // test data matrix

parameters {
   real alpha1;
   vector[K1] beta1;

model {
  beta1 ~ normal(0, 100);
    alpha1 ~ normal(0, 100);
  yt ~ bernoulli_logit_glm(x1, alpha1, beta1);
generated quantities {
  vector[N2] y_new;
    y_new = inv_logit(alpha1 + x1h * beta1);//inverse logit transformation to get predictions

My questions is: Can I improve this code to improve the efficiency?

My ultimate aim is to extend this code to do K-fold cross validation.

The model is already quite efficient, the only thing I would suggest is to make the priors on alpha1 and beta1 much smaller, since the inv_logit function will overflow to 1 when the inputs are larger than ~16.

Also, given that you’re using the bernoulli_logit_glm function, you can also use the GPU functionality to speed up the model.

Additionally, you can also use the reduce_sum parallelisation to speed up the model (if the dataset is large enough to be worth it). An example of this is in the manual here:

For both reduce_sum and (best) GPU support, you’ll want to use the cmdstanR interface (or another cmdstan interface, whichever your preference is):

@andrjohns Thank you for your reply. I will go over the sources you suggested.

I have a related question to this post. As I mentioned in the post, I wanted to extend this for a K-fold cross validation model. Lets say K=2.
To do that: First I have separated the data into folds for K-fold cross validation using R (Outside Stan environment). So that there are two data sets (training and corresponding test data sets). I order to do the 2-fold cross validation, I separately fitted two Stan models for each set of data and stored the results.
But If the value of K is large then I may need to fit K separate models. Will there is a more efficient way of doing this?
(I am reading about loo function which will approximate the leave one out cross validation and I hope to try that method also. But other than that, I am wondering whether I can do K-fold cross validation for K=5 or 10 using a more efficient method than I am doing right now.)

Thank you.

No, as far as I’m aware there’s no alternative to fitting K models when doing K-fold cross-validation.

However, you may want to look into using either rstanarm or brms for this model, as both have built-in functions for K-fold cross-validation that will automatically partition the data and run all of the models for you.



