Using Horseshoe prior in hierarchical model for variable selection

Hello, I read the wonderful tutorial by @betanalpha on horseshoe prior. Is there any work showcasing the horseshoe prior for variable selection in a hierarchical model? Or can anyone share their experience? Thank you.

There are a few different possibilities for what you are trying to do:

  • Model sparsity in fixed effects terms without touching the random effects. For example if you have a brms style model formula like y ~ a_1 + ... + a_N + (1 | group), then you can model sparsity in the coefficients for the a_i covariates using the exact same approach as in the non-hierarchical case.
  • Model sparsity in the random effects terms; i.e. shrink most random effect standard deviations to near zero.
  • Model sparsity in pairs of fixed effect terms and their associated random effects. For example, if you have a model formula like y ~ a_1 + ... + a_N + (1 + a_1 + ... + a_N | group), then you might want to ensure that whenever the coefficient for a_i shrinks to near-zero, that the standard deviation for the group-specific coefficients also shrinks to near-zero (thus removing the influence of the covariate a_i from the model entirely).

Does one of these adequately capture the use that you are asking about?

Let me just note that I don’t actually recommend variable selection. In Sparsity Blues I motivate the horseshoe prior as one of multiple ways to pull marginal posterior distributions for individual parameters below some relevance threshold so that they yield a negligible influence on inferences and predictions. Although it may sound superficially similarly that isn’t the same as remove those individual parameters entirely.

Removing variables is a decision problem, which requires setting up a utility function (what is gained and lost by keeping a variable vs removing it) and then constructing posterior inferences for each of the possible outcomes (every possible pattern of eliminated variables). Most methods have implicit choices built in which can not only lead to poor performance then those choices are inappropriate but also make it hard to understand what those choices are in the first place. The “projective predictive” method, Projection predictive variable selection – A review and recommendations for the practicing statistician, is one of the few that makes its choices a bit more explicit which is great, although I don’t personally believe that they apply in many real problems.

Prior models like the horseshoe are introduced to facilitate the implementation of a variable selection method. In particular the sparsity induced by these prior models provides an initial guess for which variables to consider removing (i.e. those whose marginal posterior distributions concentrate below the relevance threshold) which can help seed greedy approximations (i.e. try variable removal patterns around an initial pattern suggested by the posterior behavior instead of trying to search over all possible sparsity patterns).

Regardless of which methods you end up considering I encourage you to identify what assumptions they’re making so that you at least have a qualitative understanding how appropriate they might be to your particular problem. Good luck!

4 Likes

Model sparsity in pairs of fixed effect terms and their associated random effects. For example, if you have a model formula like y ~ a_1 + ... + a_N + (1 + a_1 + ... + a_N | group), then you might want to ensure that whenever the coefficient for a_i shrinks to near-zero, that the standard deviation for the group-specific coefficients also shrinks to near-zero (thus removing the influence of the covariate a_i from the model entirely).

Sorry to resurrect an old thread, but I have this exact question, and haven’t been able to figure out how to get it working. I did some searching, and this was the only place I found this discussed. Any insight would be greatly appreciated!

How many covariates do you have over which you wish to induce this form of sparsity?

This is my model:

LRR | se(LRR_SE)
                ~ (scale(HeightSum) + BarrierType + RemediatedProp + scale(FragmentLengthDifference) + 
                  scale(CommonLength) + Longevity + Fecundity + SpawningMethod + EggMorphology + LarvalDrift)*DaysSinceFlood +
                  (0 + scale(CommonLength) + Longevity + Fecundity + SpawningMethod + EggMorphology + LarvalDrift|Barrier) +
                  (0 + scale(HeightSum) + BarrierType + RemediatedProp + scale(FragmentLengthDifference)|SpeciesNames)

My data are effect sizes of fish ‘fragmented’ by artificial barriers. I have ~ 60 artificial barriers, each with ~15 fish species. Each fish at each barrier has 11 effect sizes, calculated at 11 equal intervals of ‘DaysSinceFlood’ from 0 days to 10 years.

In the model I have 5 covariates related to the features of artificial barriers to fish movement, all of which have a random slope by fish species. Similarly, I have 6 fish traits, all of which have a random slope by barrier ID. All of my covariates have an interaction term with days since flood.

It’s a complex model, but the fit is substantially improved with the random effect structure I have included, even just compared to a more simple random intercept structure. The problem is even with shrinkage of uninformative fixed effects with horseshoe priors, I still get overparamaterisation due to the group-level effects. Is it appropriate to just put the horseshoe prior on my group-level sd terms as well? Because then I get concerned that they are not paired, so I could have the random effect remain for a shrunk fixed effect, or vice versa.

Probably @avehtari would have something very smart to say about this modeling situation, but I’ll do my best.

Assuming that you intentionally omitted variation in the interactions with DaysSinceFlood from your random effects structure, you have 10 of these covariates for which you need to simultaneously shrink a random effect mean and sd. As such, you could consider a true spike-and-slab prior placed on the existence of the effects, with brute-force marginalization over the 1024 possible combinations. And then you could apply the horseshoe to all of the fixed-effect interactions that are absent from the random effects. But this doesn’t work if you mean to include the interactions in the random effects structure, because 2^{20} terms is impractical. And in any case, this is not something you can implement via brms without a lot of hacking around with custom Stan code in stanvars.

Another option (haven’t tried it; don’t know if it would work) is maybe to set informative priors on both the main effects and the random effect variances, so that neither one can be estimated to be very large, and then introduce for each covariate C_i a scaling parameter \alpha_i bounded on the unit interval such that the contribution to the linear predictor is (\mu_i + \epsilon_{ij}) * \alpha_i * C_i where i indexes the covariate and j indexes the group membership, \epsilon_{ij} being the random effect. Then put the horseshoe on \alpha. The informative priors on \mu and \epsilon (the latter via an informative prior on the random effect sd) will ensure that \alpha cannot shrink to irrelevance unless it’s fine to shrink both \mu AND \epsilon to irrelevance. Also not something you can easily achieve within brms though.

Interesting, thanks for your help. I did choose to excluse the interaction from the random effects, mostly because I knew having it in there would be a nightmare to deal with. These both seem like they should work, I’ll see how I go implementing them. I’m still a newbie to brms and haven’t delved into Stan just yet, but I’m up for a challenge. Will give it a red hot crack!

You could use a similar approach as what Griffin and Brown describe in their paper Hierarchical Shrinkage Priors for Regression Models. I think @jsocolar’s proposal could work, too.

1 Like

Thank you, that’s a really interesting nice paper. Working through it to see if I can get it working in brms, and will report back here if I can.

I am trying to work through similar problems and I find your discussion insightful.

One follow-up question, if I may, is, why would one need to pair the shrinkage of the random effect with its fixed effect coefficient? By virtue of horseshoe priors regularising towards zero in the absence of information otherwise, would one not expect the random effect to shrink to zero if the associated fixed effect shrinks to zero?

You don’t need to pair the effects in this way, but for some prior models it’s appropriate. Let’s denote the sum of the fixed effect and the zero-centered random effect as V. Shrinking the fixed effect to zero without any special treatment of the random effect will tend to cause the average magnitude of V to decrease, but it still retains the idea that groups vary in their response to V, and these responses are in general appreciable.

The prior model expressed in this thread is that some covariates matter, and their effect potentially differs by group, and other covariates do not matter at all. Under this model, we want to ensure that either both the fixed and the random effect are “included” in the model (in the sense that they are not penalized via the prior to a region of tiny effect sizes) or that they are both “excluded” (in the opposite sense).

This is a little bit complicated to think (and write!) about, because it’s also true (and this might be your point) that when the data don’t contain evidence of among-group variability, the random effect hierarchical prior itself already tends to penalize down towards zero.

1 Like

Thank you for your further insight on this. The point about incomplete sparsity reminded me of some sentences from Piironen & Vehtari 2017:

A notable difference is that Lasso produces a truly sparse solution with exact zeros for some coefficients, whereas horseshoe does not. After fitting the full model, a truly sparse solution without losing predictive accuracy could be obtained using the projective variable selection (Piironen and Vehtari, 2017b)

I have not delved into the projective variable selection approach, but I wonder if it might be interesting in a hierarchical set-up.

Yes, see this paper

1 Like