Hello, I read the wonderful tutorial by @betanalpha on horseshoe prior. Is there any work showcasing the horseshoe prior for variable selection in a hierarchical model? Or can anyone share their experience? Thank you.
There are a few different possibilities for what you are trying to do:
- Model sparsity in fixed effects terms without touching the random effects. For example if you have a
brms
style model formula likey ~ a_1 + ... + a_N + (1 | group)
, then you can model sparsity in the coefficients for thea_i
covariates using the exact same approach as in the non-hierarchical case. - Model sparsity in the random effects terms; i.e. shrink most random effect standard deviations to near zero.
- Model sparsity in pairs of fixed effect terms and their associated random effects. For example, if you have a model formula like
y ~ a_1 + ... + a_N + (1 + a_1 + ... + a_N | group)
, then you might want to ensure that whenever the coefficient fora_i
shrinks to near-zero, that the standard deviation for the group-specific coefficients also shrinks to near-zero (thus removing the influence of the covariatea_i
from the model entirely).
Does one of these adequately capture the use that you are asking about?
Let me just note that I donât actually recommend variable selection. In Sparsity Blues I motivate the horseshoe prior as one of multiple ways to pull marginal posterior distributions for individual parameters below some relevance threshold so that they yield a negligible influence on inferences and predictions. Although it may sound superficially similarly that isnât the same as remove those individual parameters entirely.
Removing variables is a decision problem, which requires setting up a utility function (what is gained and lost by keeping a variable vs removing it) and then constructing posterior inferences for each of the possible outcomes (every possible pattern of eliminated variables). Most methods have implicit choices built in which can not only lead to poor performance then those choices are inappropriate but also make it hard to understand what those choices are in the first place. The âprojective predictiveâ method, Projection predictive variable selection â A review and recommendations for the practicing statistician, is one of the few that makes its choices a bit more explicit which is great, although I donât personally believe that they apply in many real problems.
Prior models like the horseshoe are introduced to facilitate the implementation of a variable selection method. In particular the sparsity induced by these prior models provides an initial guess for which variables to consider removing (i.e. those whose marginal posterior distributions concentrate below the relevance threshold) which can help seed greedy approximations (i.e. try variable removal patterns around an initial pattern suggested by the posterior behavior instead of trying to search over all possible sparsity patterns).
Regardless of which methods you end up considering I encourage you to identify what assumptions theyâre making so that you at least have a qualitative understanding how appropriate they might be to your particular problem. Good luck!
Model sparsity in pairs of fixed effect terms and their associated random effects. For example, if you have a model formula like
y ~ a_1 + ... + a_N + (1 + a_1 + ... + a_N | group)
, then you might want to ensure that whenever the coefficient fora_i
shrinks to near-zero, that the standard deviation for the group-specific coefficients also shrinks to near-zero (thus removing the influence of the covariatea_i
from the model entirely).
Sorry to resurrect an old thread, but I have this exact question, and havenât been able to figure out how to get it working. I did some searching, and this was the only place I found this discussed. Any insight would be greatly appreciated!
How many covariates do you have over which you wish to induce this form of sparsity?
This is my model:
LRR | se(LRR_SE)
~ (scale(HeightSum) + BarrierType + RemediatedProp + scale(FragmentLengthDifference) +
scale(CommonLength) + Longevity + Fecundity + SpawningMethod + EggMorphology + LarvalDrift)*DaysSinceFlood +
(0 + scale(CommonLength) + Longevity + Fecundity + SpawningMethod + EggMorphology + LarvalDrift|Barrier) +
(0 + scale(HeightSum) + BarrierType + RemediatedProp + scale(FragmentLengthDifference)|SpeciesNames)
My data are effect sizes of fish âfragmentedâ by artificial barriers. I have ~ 60 artificial barriers, each with ~15 fish species. Each fish at each barrier has 11 effect sizes, calculated at 11 equal intervals of âDaysSinceFloodâ from 0 days to 10 years.
In the model I have 5 covariates related to the features of artificial barriers to fish movement, all of which have a random slope by fish species. Similarly, I have 6 fish traits, all of which have a random slope by barrier ID. All of my covariates have an interaction term with days since flood.
Itâs a complex model, but the fit is substantially improved with the random effect structure I have included, even just compared to a more simple random intercept structure. The problem is even with shrinkage of uninformative fixed effects with horseshoe priors, I still get overparamaterisation due to the group-level effects. Is it appropriate to just put the horseshoe prior on my group-level sd terms as well? Because then I get concerned that they are not paired, so I could have the random effect remain for a shrunk fixed effect, or vice versa.
Probably @avehtari would have something very smart to say about this modeling situation, but Iâll do my best.
Assuming that you intentionally omitted variation in the interactions with DaysSinceFlood
from your random effects structure, you have 10 of these covariates for which you need to simultaneously shrink a random effect mean and sd. As such, you could consider a true spike-and-slab prior placed on the existence of the effects, with brute-force marginalization over the 1024 possible combinations. And then you could apply the horseshoe to all of the fixed-effect interactions that are absent from the random effects. But this doesnât work if you mean to include the interactions in the random effects structure, because 2^{20} terms is impractical. And in any case, this is not something you can implement via brms
without a lot of hacking around with custom Stan code in stanvars
.
Another option (havenât tried it; donât know if it would work) is maybe to set informative priors on both the main effects and the random effect variances, so that neither one can be estimated to be very large, and then introduce for each covariate C_i a scaling parameter \alpha_i bounded on the unit interval such that the contribution to the linear predictor is (\mu_i + \epsilon_{ij}) * \alpha_i * C_i where i indexes the covariate and j indexes the group membership, \epsilon_{ij} being the random effect. Then put the horseshoe on \alpha. The informative priors on \mu and \epsilon (the latter via an informative prior on the random effect sd) will ensure that \alpha cannot shrink to irrelevance unless itâs fine to shrink both \mu AND \epsilon to irrelevance. Also not something you can easily achieve within brms
though.
Interesting, thanks for your help. I did choose to excluse the interaction from the random effects, mostly because I knew having it in there would be a nightmare to deal with. These both seem like they should work, Iâll see how I go implementing them. Iâm still a newbie to brms
and havenât delved into Stan
just yet, but Iâm up for a challenge. Will give it a red hot crack!
You could use a similar approach as what Griffin and Brown describe in their paper Hierarchical Shrinkage Priors for Regression Models. I think @jsocolarâs proposal could work, too.
Thank you, thatâs a really interesting nice paper. Working through it to see if I can get it working in brms, and will report back here if I can.
I am trying to work through similar problems and I find your discussion insightful.
One follow-up question, if I may, is, why would one need to pair the shrinkage of the random effect with its fixed effect coefficient? By virtue of horseshoe priors regularising towards zero in the absence of information otherwise, would one not expect the random effect to shrink to zero if the associated fixed effect shrinks to zero?
You donât need to pair the effects in this way, but for some prior models itâs appropriate. Letâs denote the sum of the fixed effect and the zero-centered random effect as V. Shrinking the fixed effect to zero without any special treatment of the random effect will tend to cause the average magnitude of V to decrease, but it still retains the idea that groups vary in their response to V, and these responses are in general appreciable.
The prior model expressed in this thread is that some covariates matter, and their effect potentially differs by group, and other covariates do not matter at all. Under this model, we want to ensure that either both the fixed and the random effect are âincludedâ in the model (in the sense that they are not penalized via the prior to a region of tiny effect sizes) or that they are both âexcludedâ (in the opposite sense).
This is a little bit complicated to think (and write!) about, because itâs also true (and this might be your point) that when the data donât contain evidence of among-group variability, the random effect hierarchical prior itself already tends to penalize down towards zero.
Thank you for your further insight on this. The point about incomplete sparsity reminded me of some sentences from Piironen & Vehtari 2017:
A notable difference is that Lasso produces a truly sparse solution with exact zeros for some coefficients, whereas horseshoe does not. After fitting the full model, a truly sparse solution without losing predictive accuracy could be obtained using the projective variable selection (Piironen and Vehtari, 2017b)
I have not delved into the projective variable selection approach, but I wonder if it might be interesting in a hierarchical set-up.
Yes, see this paper