Using Horseshoe prior in hierarchical model for variable selection

Hello, I read the wonderful tutorial by @betanalpha on horseshoe prior. Is there any work showcasing the horseshoe prior for variable selection in a hierarchical model? Or can anyone share their experience? Thank you.

There are a few different possibilities for what you are trying to do:

  • Model sparsity in fixed effects terms without touching the random effects. For example if you have a brms style model formula like y ~ a_1 + ... + a_N + (1 | group), then you can model sparsity in the coefficients for the a_i covariates using the exact same approach as in the non-hierarchical case.
  • Model sparsity in the random effects terms; i.e. shrink most random effect standard deviations to near zero.
  • Model sparsity in pairs of fixed effect terms and their associated random effects. For example, if you have a model formula like y ~ a_1 + ... + a_N + (1 + a_1 + ... + a_N | group), then you might want to ensure that whenever the coefficient for a_i shrinks to near-zero, that the standard deviation for the group-specific coefficients also shrinks to near-zero (thus removing the influence of the covariate a_i from the model entirely).

Does one of these adequately capture the use that you are asking about?

Let me just note that I don’t actually recommend variable selection. In Sparsity Blues I motivate the horseshoe prior as one of multiple ways to pull marginal posterior distributions for individual parameters below some relevance threshold so that they yield a negligible influence on inferences and predictions. Although it may sound superficially similarly that isn’t the same as remove those individual parameters entirely.

Removing variables is a decision problem, which requires setting up a utility function (what is gained and lost by keeping a variable vs removing it) and then constructing posterior inferences for each of the possible outcomes (every possible pattern of eliminated variables). Most methods have implicit choices built in which can not only lead to poor performance then those choices are inappropriate but also make it hard to understand what those choices are in the first place. The “projective predictive” method, Projection predictive variable selection – A review and recommendations for the practicing statistician, is one of the few that makes its choices a bit more explicit which is great, although I don’t personally believe that they apply in many real problems.

Prior models like the horseshoe are introduced to facilitate the implementation of a variable selection method. In particular the sparsity induced by these prior models provides an initial guess for which variables to consider removing (i.e. those whose marginal posterior distributions concentrate below the relevance threshold) which can help seed greedy approximations (i.e. try variable removal patterns around an initial pattern suggested by the posterior behavior instead of trying to search over all possible sparsity patterns).

Regardless of which methods you end up considering I encourage you to identify what assumptions they’re making so that you at least have a qualitative understanding how appropriate they might be to your particular problem. Good luck!