I’m trying to fit a linear regression with the regularized horseshoe as the prior on my coefficients. n is a few hundred, p is around a thousand in my full dataset.
Here is a github repo documenting this attempt, including code, figures, and data. I’m using rstanarm and following Aki’s slides.
My trouble is that I don’t see the shrinkage that I expected. Oddly when I specify the prior on the coefficients as gaussian, the coefficients seem to shrink more to 0 with a few staying quite large. I think I’ve either made an error in my code, or maybe I’m expecting too much from the horseshoe prior.
I’m at the boundary of my knowledge and lack good intuition about what is happening, hence my post here. There is extreme correlation among predictors in my model and perhaps it’s difficult to pick the few “truly” non-zero coefficients. Any suggestions for how to continue would be much appreciated. I’d love to try this method out with our datasets.
I can also post code here too if that’s more helpful.
I had now just a quick look (too late for more). Did you notice the scale in your figures? Coefficients are smaller with horseshoe (scale is from -100 to 100) than with Gaussian (scale is from -10000 to 10000). Also with this strong collinearity marginal posteriors are difficult to interpret.
I see the very large scale difference now (somehow I wasn’t paying attention to that and just looking at the shape of the plot(fit)), and on second look I realize the coefficients that might have been non zero in the gaussian case could just be due to noise.
Looking at the correlation matrix, there are really only a few distinct groups of predictors. I (naively) thought that the coefficients for all but the strongest predictor(s) in each group would get shrunk towards zero (basically using this for variable selection). Given the trouble of multicollinearity, do you have suggestions of a better approach?
OK, thanks for the information and suggestion. Currently partial least squares regression is our group’s bread and butter, but we want to compare it to other methods, and supervised PCA is definitely one of them.
Thanks, this is very helpful. I will have to spend some time with your paper. Sections 3.4 and 3.7 seem especially relevant for what I want to do but will require I do more reading.
I’m curious to compare the permutation test approach you use to select features and the cross validation approach as implemented in the superpc R package. The best alpha value for the permutation test seems like it is dataset dependent. How did you decide to use alpha = 0.01 and was the performance sensitive to this value?
For supervised PCA I used alpha = 0.001 in all experiments. The motivation was that then the number of false positives would be quite small (a few features for datasets with a few thousand variables on expectation) and that would not presumably affect the results much since in most of the datasets there were several hundred features that survived the screening. Using for example alpha = 0.05 instead might have some effect since the number of false positives would be larger, but I did not do systematic testing about this. Still I believe that utilizing more than only the first supervised component somewhat mitigates the effect of non-optimal selection of alpha. In the original paper of Bair et al. (2006) they used only the first supervised component, and I would expect that to be much more sensitive to how the threshold is chosen.
For iterative supervised PCA I used alpha = 0.01 for the stopping criterion of the supervised iteration. Also here the motivation was that it would be rare to make a false discovery for a given dataset (since the number of supervised iterations is usually small, most of the time only one or two). Again I did not do systematic tests for this threshold, but for most of the datasets I observed that the p-values were typically either very small (<0.001) or much larger than 0.01.