Hi there,

I have a complex hierarchical model and a latent group-specific parameter \theta_j,~j=1,...,J. For this parameter, I want to test multiple associations with group-specific variables Y_{jk},~k=1,...,K. By testing, I mean a simple linear association of the form

\theta_j = \sum_k \beta_k * Y_{jk} + \epsilon

If I throw in all Y at once (multivariate model), the \beta are stealing from each other as many of the Y are strongly correlated. In the univariate case (estimating the model separately for each Y_k) I feel like I would have to adjust for the number of associations being tested, i.e. the more associations I test the more likely I will find significant’’ ones just by coincidence.

I am aware of frequentist adjustments for multiple hypothesis testing. In the Bayesian context, it was argued that usually no adjustments are necessary (e.g. Gelman et al.). The examples revolve around constructing hierarchical models instead of subgroup comparisons (e.g. of treatments). In these settings, one can expect to have a population-level effect that is shared by all groups. However, I feel this is not applicable in my setting because one would not expect the associations to go in the same direction for all variables. Some variables may be positively associated, some negatively. Strongly correlated variables are expected to have similar associations, which I also find when estimating separate models for each variable Y_k.

Am I correct that hierarchical modeling of the associations is not the best way to go in my setting? If necessary, what adjustments could/should I make or how can I model all aossociations?

Note that a separate thing I am trying in this setting is a latent factor model for Y.

Thanks a lot for any suggestions!

Cheers,
Nic

Well, normally the QR transform is recommended for correlated predictors, and the Finnish Horseshoe prior is recommended for the scenario of many predictors but where only some are expected to be non-zero. @avehtari does QR+horseshoe sound like the appropriate structure for this scenario?

I don’t necessarily expect only some predictors to be non-zero. So it is not a feature selection problem. Basically, all variables could have a significant association. I am just concerned that by testing many associations I will capture random ones. Thus, I thought it would be necessary to adjust. Also, it might be nice to borrow information from other variables. That is, if Y1 and Y2 are highly correlated, then after estimating the association for Y1, I already have a prior belief for Y2.

So if I’m correct, you want to control your type 1 error rate? You really can’t do that in a bayesian setting.
But you can follow the suggestion by @mike-lawrence and achieve something similar.

Yes, I am worried about type 1 errors. Frequentists would adjust the significance level for the number of hypotheses they test. I feel I would have to do something similar.

It seems to me that QR decomposition rather addresses issues regarding computational efficiency and convergence of the model parameters, and the horseshoe priors is considered in the context of feature selection. I concerned with neither. Instead, I have a set of independent variables and I believe each of them could be strongly associated with the dependent variable. Yet, the more associations I consider, the more likely I will find “significant” ones just by change. Wouldn’t I need to adjust for this?

If the number of predictors is big, and you expect many to have non-zero coefficient then they can’t be all big coefficients unless they are highly correlating (see, e.g., [2105.13445] The piranha problem: Large effects swimming in a small pond).

You can separate the sign and magnitude and assume that the magnitudes would be similar.

Hierarchical modeling of the associations is the best way.

You can reduce the randomness by making a good hierarchical joint model and then use the reference predictive approach [2004.13118] Using reference models in variable selection

2 Likes

Thank you so much for this elaborate and helpful advice! Upon reading the suggested work, I also came across the Iterative Supervised Principal Component (ISPC) approach, which could also resolve another issue I am having with PCA (see forum topic here).

In the following, one comment/clarification and quick follow-up question if I understood your suggestion with the hierarchical model correctly.

They are highly correlated. But I am still hesitant with approaches to feature selection. Say I have two strongly correlated features having similar associations with Y. Without knowing the causal link, both associations seem valid. For instance, I have two variables that are both associated with the target: GDP and a government effectiveness indicator. Both variables are also highly correlated (high income countries usually have more effective governments). A minimum feature set would probably only consider GDP, yet my focus is not on prediction but to present both plausible associations of the target (with income and/or with government effectiveness). I feel a ISPC approach might be suitable for this, although it could be a good additional analysis to present results from a sparse model to show the most relevant predictors.

If I understand correctly, that is something I thought of as well. So basically the idea would be to form e.g. two groups: 1) variables I expect positive associations, 2) variables I expect negative associations. Then estimate a hierarchical model partially pooling the estimates of individual variables within groups 1) and 2). Is that what you meant?

Thanks once again for your time and help!

1 Like

It seems you have too constrained view of feature selection. Feature selection in general can be used either in minimum feature set selection of in finding all detectable associations. We discuss also the complete set feature selection in [2004.13118] Using reference models in variable selection.

For each coefficient j have its own normal prior with scale \sigma_j and make a common hierarchical model for all \sigma_j. Then your hierarchical model is just for the magnitudes and not for the signs. You don’t need to know the signs and don’t need to do any pre-grouping. If you think there are groups in the features, you can add that as a prior information or let the model infer the groups structure.

1 Like

Thanks again, I think I am making progress.

First, I probably should have said that I have J=40 observations (though note that the target is a latent variable, i.e. parameter) and K=11 predictors. I guess with just 11 predictors, feature selection e.g. via horseshoe prior is not best.

The hierarchical modeling approach seems promising. My problem is that the effects are additive and thus (as expected) the magnitude of the individual associations are reduced as compared to when estimating the associations one at a time. I understand that not all coefficients can be big, but my point is that two variables that are correlated (because they share common information) should have associations of similar magnitude.

A way around this could be a latent factor model, which I already tried, but the issue is that the factor loadings are not invariant to the ordering of the variables (see my issue here). Besides, I would want to retain those latent factors that are most strongly associated with the target (which is why I found the ISPC approach interesting). Alternatively, one could consider all latent factors (analogously all principal components) and then estimate their association with the target, probably with a sparsity prior.

Edit: messed up with the code, the following summary is rubbish it seems, thus hidden

Summary

What I now did was to combine the suggestion from @mike-lawrence for the QR decomposition (which similar to the PCA gives me an orthogonal matrix) and your suggestion of the hierarchical model. That is, I compute the orthogonal matrix Q, where Y=QR and put a hierarchical prior on \tilde{\beta}.

Here is example code where for simplicity I used as target the posterior mean of \theta_j:

data {
int<lower=1> J;
int<lower=1> K; // number of moderators
matrix[J,K] Y; // moderators
vector[J] theta_j;
}

transformed data {
// Compute, thin, and then scale QR decomposition
matrix[J,K] Q = qr_Q(Y)[, 1:K] * J;
matrix[K,K] R = qr_R(Y)[1:K, ] / J;
matrix[K,K] R_inv = inverse(R);
}

parameters {
real<lower=0> sigma;
vector[K] beta_tilde;
real<lower=0> tau;
}

transformed parameters {
vector[K] beta = R_inv * beta_tilde;
}

model {
// priors
beta_tilde ~ student_t(4., 0., tau);
tau ~ student_t(4., 0., .125);
sigma ~ student_t(4., 0., 0.1);

// likelihood
theta_j ~ normal(Q * beta_tilde, sigma);
}



The question is: does this make sense? Would it be better to use the hierarchical prior on the principal components?

Do you happen to have a reference/example of a model like this being used? I would use hierarchical models like this all the time but I’ve never been confident enough in exactly how to specify them.

I don’t remember a good direct example, but…

Without groups or clustering horseshoe and R2D2 priors have this kind of structure Sparsity information and regularization in the horseshoe and other shrinkage priors
https://www.tandfonline.com/doi/full/10.1080/01621459.2020.1825449

With pre-defined groups, just make separate prior for each group

A two-step approach could estimate groups based on correlation matrix

A one-step clustering would include probabilities that each coefficient belongs to some group. You could combine clustering code with the above mentioned priors, but this is quite complex and has a challenging posterior geometry, so I don’t recommend starting from this.

2 Likes