Is your idea of model just a subset of the 35 predictors? That’s a space of 2^35 possible models, so it’s a bit too large to use model comparison techniques on. You can use something like horseshoe to estimate a single model with shrinkage and then truncate the small values. Or you can just leave them as they won’t do much and 35 variables isn’t too many to compute with (unless it’s a very high-performance setting, in which case you’re not using Stan anyway).
But that requires a different prior than the one you specify here. I don’t know if the horseshoe priors are an option for the stan_glm function, but then I’m not even sure which package that comes from (rstanarm?).
I’m also unclear on why you say y is a dummy variable. Isn’t that the observed data?
Very thanks for your reply! I have found the mistakes of my data that have some variable is still character.
I’m sorry to use a "dummy variable ", the y is definitely the observed data.
The variables usually have strong correlations, the model have strong collinearity.
And I use the projection predictive variable selection. the result is
The suggest size is 3 !
So I think the result is not true. But I can’t find the problem
The horseshoe prior I haven’t tried it , and If I have the results I will tell you!
very thanks~
Just some food for thought based on my experience with the projpred() package. It’s a fantastic feature selection tool, although you may want to play around with the HS prior, as it can pretty aggressively regularize your estimates. This is generally a feature, not a bug, but it may be why it doesn’t match your intuitions on your expected outcome (although I would default to trusting it, as sometimes the regular features of our data aren’t what we expected, and we are fooled by sparsity and irregularities). Typically, the number of rows/features won’t be a problem in a properly posed model run in Stan, other than possibly providing estimates with more uncertainty than you may expect to see. Do you have any more info on why you believe the answer you’re getting is ‘wrong’ or why the results aren’t true?
Thanks for your reply!
I agree with your suggestions. I will make some comparison with other models!
And When I use the cv_varsel function and set validate_search=TRUE, it will take a long time to have a result.
I want to know if I set the validate_search=FALSE can I believe the results?
No problem! My experience with Stan, and Bayesian modelling in general, is that if you have the time, it’s always worth doing the ‘full’ option versus any approximation, whether that be an mcmc vs. a variational approximation, or in this case, for cross validation. On the package homepage, they provide the following.
There are two functions for performing the variable selection: varsel() and cv_varsel() . In contrast tovarsel() , cv_varsel() performs a cross-validation (CV) by running the search part with the training data of each CV fold separately (an exception is validate_search = FALSE , see ?cv_varsel and below) and running the evaluation part on the corresponding test set of each CV fold. Because of this CV, cv_varsel() is recommended over varsel() . Thus, we use cv_varsel() here. Nonetheless, running varsel() first can offer a rough idea of the performance of the submodels (after projecting the reference model onto them). A more principled projpred workflow is work under progress.
I think this is pretty solid advice to follow. If you’re just testing out several different model formations, the varsel() option is fine, but for any sort of downstream analysis or ‘final’ analysis, the cv_varsel is worth the wait to get the proper cross validation.
Warning message:
In suggest_size.vsel(cvvs2) :
Could not suggest submodel size. Investigate plot.vsel() to identify if the search was terminated too early. If this is the case, run variable selection with larger value for `nterms_max`.
I think you need to look at your model a bit more carefully - this suggests that there is no optimal number of features, that nothing actually beats a model with no variables; there is no predictive power out-of-bag whatsoever. I think you may want to start with a smaller model, and possibly without your sample data - simulations can really help uncover modelling problems.