Hi Everyone,
In the following stan model, I have fitted a logistic regression model for training data. Then I want to calculate the misclassification error based on test data. For that I have first obtained the predicted probabilities for test data.
data {
int<lower=1> N1;
int<lower=1> N2;
int<lower=1> K1;
int<lower=0,upper=1> yt[N1]; //response of training data
matrix[N1,K1] x1;//training data matrix
matrix[N2,K1] x1h; // test data matrix
}
parameters {
real alpha1;
vector[K1] beta1;
}
model {
beta1 ~ normal(0, 100);
alpha1 ~ normal(0, 100);
yt ~ bernoulli_logit_glm(x1, alpha1, beta1);
}
generated quantities {
vector[N2] y_new;
y_new = inv_logit(alpha1 + x1h * beta1);//inverse logit transformation to get predictions
}
My questions is: Can I improve this code to improve the efficiency?
My ultimate aim is to extend this code to do K-fold cross validation.
The model is already quite efficient, the only thing I would suggest is to make the priors on alpha1
and beta1
much smaller, since the inv_logit
function will overflow to 1 when the inputs are larger than ~16.
Also, given that you’re using the bernoulli_logit_glm
function, you can also use the GPU functionality to speed up the model.
Additionally, you can also use the reduce_sum
parallelisation to speed up the model (if the dataset is large enough to be worth it). An example of this is in the manual here: https://mc-stan.org/docs/2_25/stan-users-guide/reduce-sum.html#example-logistic-regression
For both reduce_sum
and (best) GPU support, you’ll want to use the cmdstanR
interface (or another cmdstan interface, whichever your preference is): https://mc-stan.org/cmdstanr/
1 Like
@andrjohns Thank you for your reply. I will go over the sources you suggested.
I have a related question to this post. As I mentioned in the post, I wanted to extend this for a K-fold cross validation model. Lets say K=2.
To do that: First I have separated the data into folds for K-fold cross validation using R (Outside Stan environment). So that there are two data sets (training and corresponding test data sets). I order to do the 2-fold cross validation, I separately fitted two Stan models for each set of data and stored the results.
But If the value of K is large then I may need to fit K separate models. Will there is a more efficient way of doing this?
(I am reading about loo
function which will approximate the leave one out cross validation and I hope to try that method also. But other than that, I am wondering whether I can do K-fold cross validation for K=5 or 10 using a more efficient method than I am doing right now.)
Thank you.
No, as far as I’m aware there’s no alternative to fitting K models when doing K-fold cross-validation.
However, you may want to look into using either rstanarm
or brms
for this model, as both have built-in functions for K-fold cross-validation that will automatically partition the data and run all of the models for you.
rstanarm
: https://mc-stan.org/rstanarm/reference/kfold.stanreg.html
brms
: http://paul-buerkner.github.io/brms/reference/kfold.html
1 Like