 # Model to calculate Misclassification error based on test data

Hi Everyone,

In the following stan model, I have fitted a logistic regression model for training data. Then I want to calculate the misclassification error based on test data. For that I have first obtained the predicted probabilities for test data.

``````
data {
int<lower=1> N1;
int<lower=1> N2;
int<lower=1> K1;
int<lower=0,upper=1> yt[N1]; //response of training data
matrix[N1,K1] x1;//training data matrix
matrix[N2,K1] x1h; // test data matrix

}

parameters {
real alpha1;
vector[K1] beta1;

}

model {

beta1 ~ normal(0, 100);
alpha1 ~ normal(0, 100);

yt ~ bernoulli_logit_glm(x1, alpha1, beta1);

}
generated quantities {
vector[N2] y_new;

y_new = inv_logit(alpha1 + x1h * beta1);//inverse logit transformation to get predictions
}

``````

My questions is: Can I improve this code to improve the efficiency?

My ultimate aim is to extend this code to do K-fold cross validation.

The model is already quite efficient, the only thing I would suggest is to make the priors on `alpha1` and `beta1` much smaller, since the `inv_logit` function will overflow to 1 when the inputs are larger than ~16.

Also, given that you’re using the `bernoulli_logit_glm` function, you can also use the GPU functionality to speed up the model.

Additionally, you can also use the `reduce_sum` parallelisation to speed up the model (if the dataset is large enough to be worth it). An example of this is in the manual here: https://mc-stan.org/docs/2_25/stan-users-guide/reduce-sum.html#example-logistic-regression

For both `reduce_sum` and (best) GPU support, you’ll want to use the `cmdstanR` interface (or another cmdstan interface, whichever your preference is): https://mc-stan.org/cmdstanr/

1 Like

@andrjohns Thank you for your reply. I will go over the sources you suggested.

I have a related question to this post. As I mentioned in the post, I wanted to extend this for a K-fold cross validation model. Lets say K=2.
To do that: First I have separated the data into folds for K-fold cross validation using R (Outside Stan environment). So that there are two data sets (training and corresponding test data sets). I order to do the 2-fold cross validation, I separately fitted two Stan models for each set of data and stored the results.
But If the value of K is large then I may need to fit K separate models. Will there is a more efficient way of doing this?
(I am reading about` loo` function which will approximate the leave one out cross validation and I hope to try that method also. But other than that, I am wondering whether I can do K-fold cross validation for K=5 or 10 using a more efficient method than I am doing right now.)

Thank you.

No, as far as I’m aware there’s no alternative to fitting K models when doing K-fold cross-validation.

However, you may want to look into using either `rstanarm` or `brms` for this model, as both have built-in functions for K-fold cross-validation that will automatically partition the data and run all of the models for you.

`rstanarm`: https://mc-stan.org/rstanarm/reference/kfold.stanreg.html

`brms`: http://paul-buerkner.github.io/brms/reference/kfold.html

1 Like