Sorry for the long post - please let me know if this format of question is welcome here or if it needs to be more on topic.
My fundamental question is: is there any literature or experiences on the topic of modeling structured sparsity as described here?
I’m currently trying to model longitudinal data using a random intercept + slope mixed linear model with lots of candidate predictors. These predictors are structured in a tree-like way - think of different measurement platforms (layer 1) each measuring different marker (individual predictors, layer 3). Most of these markers belong to non-overlapping groups (layer 2), defined by domain specific knowledge and manifested in strong correlations. I.e. we have a coarse grouping (layer 1), a finer grouping (layer 2 nested within layer 1) and the individual predictors (layer 3 nested within layer 2).
The goal is to find important predictors.
Due to the complexity of this task, I decided to give bayesian modeling a try. Several possible approaches come to mind, focusing on the modeling of layers 2 and 3 for now:
- The base model: ignore the structure and just model layer 3. I apply the regularized horseshoe for this.
- Grouped horseshoe 1: implement each coefficient as \beta_j \sim N(0, \tau \lambda_{G_j} \lambda_j) where \tau and \lambda_j are the usual global-local scales of the regularized horseshoe and \lambda_{G_j} should serve as intermediary scale for the group of predictor j.
- Grouped horseshoe 2: as suggested by avehtari here , the local scales could have group specific parameter.
- Multivariate Horseshoe: suggested here. Would allow to let coefficients from a group be correlated.
My intuitions are the following:
- Model 2 allows some groups being shrunk less than others, making it predictors from this group easier to contribute to the predictions.
- Model 2 and 3 should be mostly equivalent, with model 2 being conceptually simpler to generalize to include another layer too.
- Model 4 seems too complicated for a large number of covariates, with too many parameters.
I have already compared Models 1 and 2, with preliminary results agreeing to my prior thoughts:
Model 2 seems to favor a certain group (smallest group shrinkage) and thus more predictors from this group end up in the top X list of predictors, compared to Model 1.
My remaining questions:
- Do you see any fundamental issues with Model 2?
- Is there an “easy” way to “push” the group effect onto a single representative from a group? This could probably be done using pca per group or similar, but the real goal would be to only end up with a single (or very few) candidates per group rather than having to measure all of them.
Thanks for looking!