Thanks again, I think I am making progress.
First, I probably should have said that I have J=40 observations (though note that the target is a latent variable, i.e. parameter) and K=11 predictors. I guess with just 11 predictors, feature selection e.g. via horseshoe prior is not best.
The hierarchical modeling approach seems promising. My problem is that the effects are additive and thus (as expected) the magnitude of the individual associations are reduced as compared to when estimating the associations one at a time. I understand that not all coefficients can be big, but my point is that two variables that are correlated (because they share common information) should have associations of similar magnitude.
A way around this could be a latent factor model, which I already tried, but the issue is that the factor loadings are not invariant to the ordering of the variables (see my issue here). Besides, I would want to retain those latent factors that are most strongly associated with the target (which is why I found the ISPC approach interesting). Alternatively, one could consider all latent factors (analogously all principal components) and then estimate their association with the target, probably with a sparsity prior.
Edit: messed up with the code, the following summary is rubbish it seems, thus hidden
Summary
What I now did was to combine the suggestion from @mike-lawrence for the QR decomposition (which similar to the PCA gives me an orthogonal matrix) and your suggestion of the hierarchical model. That is, I compute the orthogonal matrix Q, where Y=QR and put a hierarchical prior on \tilde{\beta}.
Here is example code where for simplicity I used as target the posterior mean of \theta_j:
data {
int<lower=1> J;
int<lower=1> K; // number of moderators
matrix[J,K] Y; // moderators
vector[J] theta_j;
}
transformed data {
// Compute, thin, and then scale QR decomposition
matrix[J,K] Q = qr_Q(Y)[, 1:K] * J;
matrix[K,K] R = qr_R(Y)[1:K, ] / J;
matrix[K,K] R_inv = inverse(R);
}
parameters {
real<lower=0> sigma;
vector[K] beta_tilde;
real<lower=0> tau;
}
transformed parameters {
vector[K] beta = R_inv * beta_tilde;
}
model {
// priors
beta_tilde ~ student_t(4., 0., tau);
tau ~ student_t(4., 0., .125);
sigma ~ student_t(4., 0., 0.1);
// likelihood
theta_j ~ normal(Q * beta_tilde, sigma);
}
The question is: does this make sense? Would it be better to use the hierarchical prior on the principal components?