I’m confused about different approaches people use for Bayesian model selection.
I understand the frequentist approach is generally to fit the most complex model first with all predictors and their hypothesized interactions, then run subsequent models removing one term at a time, and use something like a likelihood ratio tests to compare models and select the simplest model that does not significantly reduce model fit.
I’ve seen numerous papers using Bayesian models do this in the opposite way, where each predictor is fitted first on it’s own, then only significant predictors (credible intervals that don’t include 0) are fitted in the next round.
I’m struggling to find information about what the benefit of starting simple with individual predictors is, other than it being easier to get models to converge. Wouldn’t you risk throwing away important predictors that might be non-significant on their own, but be significant when included as an interaction with another variable? Am I missing something, or if I’m able to get my most complex model to converge, would it be preferable to use the method of starting complex and dropping terms?
Without knowing the field of application and the specifics of the problem, a general procedure would be:
- Model fitting would start from the simplest model (to make sure your model is ok and you understand the data, etc.); and then build up to the complete model, with all predictors, no feature selection, unless motivated by model issues (aka not by “significance”), e.g. extreme correlations between features
- Once you are satisfied with step 1, you can move onto hypothesis testing (which features are relevant, given all the others), which would happen on the full model. You can also look into projpred by @avehtari for how to identify models with a smaller set of features (always starting from the full model).
2 Likes
Bayesian workflow paper and my talk related to the paper discuss the benefits of starting from simple. The covariate (and some other model structure) selection cases are special as then there is a combinatorial explosion of the number of models, but it’s easy to add include covariates and sensible prior and appriopriate use of of decision theory helps going from the biggest model to a submodel with similar performance as discussed in a video and paper 1, paper 2, and paper 3.
So it depends on the case which direction easier.
4 Likes