Does it make sense to apply regularization simultaneously with model averaging?

I have been tasked with making a predictive model for a set of ~80 observations, each of which has ~250 associated measurements.

I have skimmed over Using stacking to average Bayesian predictive distributions. If I understand correctly, this technique takes a group of already-estimated models and finds the optimal weighing for their predictions.

On the other hand, On the Hyperprior Choice for the Global Shrinkage Parameter in the Horseshoe Prior provides another path to optimal prediction by restricting the number of variables that are allowed to have an influence.

Now, I could in principle fit several thousand models on different subsets and transformations of my variables and then let model averaging aggregate them howevers it sees fit. On the other hand, I could run a single model with the full set of variables while applying shrinkage. Or I could apply shrinkage to each submodel being estimated before averaging.

I believe model averaging might still be relevant if I wish to consider different error structures and other such variations not readily accommodated by a shrinkage paradigm. But sticking to this more “variable selection”-y situation, I would like to know what is the more reasonable approach (perhaps there are others I have not considered?).

In general, even if you only run a single model in Stan, you’ll be doing model averaging in the predictive phase by virtue of averaging over the model parameter posterior.

I believe the gold standard approach is to build a model that spans all of the models you believe might have generated the data. So if you think your data is generated by a linear model over some complete set of covariates, you should let that be your model. If you then believe that some unknown number of these covariates have zero influence, you should code that into your prior using e.g. the regularized horseshoe.

Stacking is (IIRC) appropriate if you want to average over models that cannot be fit into a nice continuous model expansion, with parameters interpolating between the different model choices. Maybe you think there might be 3 different particular sparsity patterns at play; then you could do inference for each, and use the stacking approach.

One caveat with the above recommendation: even if you can do a continuous super-model like described above, it might have a terrible geometry, making it difficult to sample from. So while it would be ideal to average over the super-model’s parameter space, it might not be the most efficient.

3 Likes

This isn’t what’s normally called “Bayesian model averaging”.

The two papers being sited in the original post are trying to do different things, though you could use both to determine whether to use a predictor in a regression model, for example.