Shrinkage and feature selection in designed experiments using brms - which method to choose

Please also provide the following information in addition to your question:

  • Operating System: Linux Ubuntu
  • brms Version: 2.10.0

When developing a new product blend, I run a response surface design of experiment. The results are then analyzed using linear regression. Often I vary e.g four factors at the time, and the design will support main, interaction, and squared effects = Intercept + 4 main + 4 squared + 6 interaction terms.
Features are centered and scaled.

In a frequentist setting, I would use leave one out cross validation and remove terms that improve predictive performance (Q2 = 1 - (prediction residual sum of squares / total sum of squares)).
Terms are removed observing a hierarchy, meaning I cannot leave out a main effect but still leave in an interaction term or a squared effect.

I would like to build an automatic pipeline to analyze these kinds of experiments, and I would like to use a Bayesian approach, preferable using brms. Now I’m wondering how to select the best subset in a Bayesian setting using brms.

  1. shrinkage: should I opt for shrinkage using either a horsehoe prior on the effects or use a multilevel model, assuming that effects are normally distributed? Using this route, I do not need to worry about feature hierarchy! However, will there be enough predictors (e.g. 8+) in my setting for this to make sense?
  2. fit several models, representing all valid formulas and find the best performing model according LOO?
  3. something entirely different…

If 2) is the best option, do you have any ideas how to automatic generate all valid formulas according to the hierarchy criteria outlined above?

Any insights, ideas, and thoughts are most welcome.

Thank you for being e great community.

/Jannik

4 Likes

Have you considered the rstanarm::stan_lm approach? It seems to be exactly what you are looking for. The vignette on estimating regularized linear regressions uses an example from an experiment with a number of factors and some interactions. The idea of the approach is that you specify a prior on the R^2, i.e. the overall model which provides shrinkage for all parameters in a principled way. In the example, the prior has better out-of-sample predictions according to the loo estimate and uses less effective parameters.

2 Likes

Thank you stijn.

No I was not aware of this resource. I will take a closer look. Thank you for pointing me in this direction.

/Jannik

Hi Jannik,

The prior on R**2 for linear models is an interesting approach.
My first question was whether it an accepted, peer-reviewed practice yet?

There is a recent peer-reviewed paper about a Bayesian R**2 for linear regression.

@article{Gelman2019RSquaredForBayesianRegressionModels,
title={R-squared for Bayesian regression models},
author={Gelman, Andrew and Goodrich, Ben and Gabry, Jonah and Vehtari, Aki},
journal={The American Statistician},
volume={73},
number={3},
pages={307–309},
year={2019},
publisher={Taylor & Francis}
}

A link to the pdf can be found here.

Best regards,

Blaine

Remving terms is unlikely to improve true performance and possible improvement in LOO is by chance and the approach can lead to severe overfitting in the selection process. See Juho Piironen and Aki Vehtari (2017). Comparison of Bayesian predictive methods for model selection. Statistics and Computing, 27(3):711-735.

Use projpred package with brms. Start checking bodyfat, diabetes and projpred at Model selection tutorials and talks. For more information see the video “Model assessment, comparison and selection at Master class in Bayesian statistics, CIRM, Marseille” linked in that same page. For additional reading see Juho Piironen and Aki Vehtari (2017). Comparison of Bayesian predictive methods for model selection. Statistics and Computing, 27(3):711-735., and Piironen, Markus Paasiniemi, and Aki Vehtari (2018). Projective inference in high-dimensional problems: prediction and feature selection..

If you many more observations than predictors, then (regularized) horseshoe is not necessarily needed, but in my experiments it has never made the results worse. See the above mentioned examples.

  1. is fine if you have lot of data compared to the total number of models and you don’t suffer much from the overfitting due to the selection process.

  2. Projection predictive variable selection (projpred) is much much more resistent for overfitting as you can see from the Piironen and Vehtari (2017) mentioned above.

For the current version of projpred you need to define the interactions terms also explicitly so that tehy look just like any other covariate for projpred. projpred does then the search through the model space for you. The next release will support interactions defined in the formulas making it even easier.

2 Likes

Yes, see stan_lm vignette for the reference.

1 Like

And, see also Bayesian Variable Selection in Searching for Additive and Dominant Effects in Genome-Wide Data for practical example of eliciting prior for R^2.

Found it! Thank you Aki!

Thanks Aki,

So if I understand the material correctly, I could fit the maximal model (perhaps with a suitable prior on R2) supported by the design as the baseline model, and then use projpred and crossvalidation to find a minimal model with similar predictive performance on the seen data. This approach would further minimize the risk of overfitting thereby giving more accurate prediction on unseen data within design space.

Now I’m speculating if it has been investigated whether this approach would also allow accurate extrapolation predictions further away from the design space than the maximal model? That is whether one could say that it would be safer to extrapolate using the minimal model? (I sometimes need to move outside design space for optimization purposes)

Furthermore, is it correct to assume, that with the current approach, where interactions are added as extra features, the minimal model could in principle include an interaction term without the main effect, breaking the feature hierarchy explained in my post above?

Finally, ( and perhaps a question belonging in a different thread ), as hinted above, I would like to use the minimal model, to search design space (and slightly beyond) to find the optimal settings of features that optimize one or several design goals. Are you aware of or can you recommend a framework in R to do this efficiently with these types of models?

(Currently, I’m exploring the desirability package by Max Kuhn and the optim package to do this)

Thanks once again, Jannik

  • if number of observations is much larger than the number of parameters then prior choice doesn’t matter match
  • if number of observations is not much larger or is smaller than thenumber of parameters and you assume all covariates are relevant, then prior on R^2 is a good choice
  • if number of observations is not much larger or is smaller than thenumber of parameters and you assume only some covariates are relevant (but you don’t know which), then regularized horseshoe prior on weights is a good choice

Yes.

Extrapolation requires additional assumptions on independence of the covariates, measurement model for the covariate values, and possible non-linearity. If there are correlating covariates, it may be safer to include many of them so that the model is less sensitive to errors of measuring value for a single covariate. Thus there is no simple answer to this.

You can do the projection to any model you like, so if the current projpred gives you a model which drops a main effect, you can put it back and project again. The next release understands interactions and keeps the main effects.

No, but if you find something I would like to learn if there are any good packages for this.

3 Likes