Linear regression with horseshoe prior, not seeing the shrink

tedward · January 29, 2018, 8:41pm

I’m trying to fit a linear regression with the regularized horseshoe as the prior on my coefficients. n is a few hundred, p is around a thousand in my full dataset.

Here is a github repo documenting this attempt, including code, figures, and data. I’m using rstanarm and following Aki’s slides.

My trouble is that I don’t see the shrinkage that I expected. Oddly when I specify the prior on the coefficients as gaussian, the coefficients seem to shrink more to 0 with a few staying quite large. I think I’ve either made an error in my code, or maybe I’m expecting too much from the horseshoe prior.

I’m at the boundary of my knowledge and lack good intuition about what is happening, hence my post here. There is extreme correlation among predictors in my model and perhaps it’s difficult to pick the few “truly” non-zero coefficients. Any suggestions for how to continue would be much appreciated. I’d love to try this method out with our datasets.

I can also post code here too if that’s more helpful.

Thanks for your help.

avehtari · January 29, 2018, 10:09pm

I had now just a quick look (too late for more). Did you notice the scale in your figures? Coefficients are smaller with horseshoe (scale is from -100 to 100) than with Gaussian (scale is from -10000 to 10000). Also with this strong collinearity marginal posteriors are difficult to interpret.

tedward · January 30, 2018, 2:07pm

I see the very large scale difference now (somehow I wasn’t paying attention to that and just looking at the shape of the plot(fit)), and on second look I realize the coefficients that might have been non zero in the gaussian case could just be due to noise.

Looking at the correlation matrix, there are really only a few distinct groups of predictors. I (naively) thought that the coefficients for all but the strongest predictor(s) in each group would get shrunk towards zero (basically using this for variable selection). Given the trouble of multicollinearity, do you have suggestions of a better approach?

avehtari · January 30, 2018, 4:30pm

Sometimes horseshoe picks the most relevant ones from groups of correlating variables, but usually this is case with less variables.

Here there is likely to be a lot of information shared between variables. You could try dimension reduction first as discussed here [1710.06229] Iterative Supervised Principal Components

avehtari · January 30, 2018, 5:50pm

Sorry, I just realised that code is not yet online, but we’ll put it online soon (not later than February).

tedward · January 30, 2018, 6:14pm

OK, thanks for the information and suggestion. Currently partial least squares regression is our group’s bread and butter, but we want to compare it to other methods, and supervised PCA is definitely one of them.

avehtari · January 30, 2018, 6:48pm

If you know how to do SPCA, then you can combine that with (regularized) horseshoe regression as in our paper (and then continue with projpred)

avehtari · February 5, 2018, 8:10pm

Now the code for iterative supervised PCA described in [1710.06229] Iterative Supervised Principal Components (+supervised PCA and plain PCA) is here GitHub - jpiironen/dimreduce

tedward · February 7, 2018, 4:42pm

Thanks, this is very helpful. I will have to spend some time with your paper. Sections 3.4 and 3.7 seem especially relevant for what I want to do but will require I do more reading.

I’m curious to compare the permutation test approach you use to select features and the cross validation approach as implemented in the superpc R package. The best alpha value for the permutation test seems like it is dataset dependent. How did you decide to use alpha = 0.01 and was the performance sensitive to this value?

jpiironen · February 8, 2018, 7:16am

For supervised PCA I used alpha = 0.001 in all experiments. The motivation was that then the number of false positives would be quite small (a few features for datasets with a few thousand variables on expectation) and that would not presumably affect the results much since in most of the datasets there were several hundred features that survived the screening. Using for example alpha = 0.05 instead might have some effect since the number of false positives would be larger, but I did not do systematic testing about this. Still I believe that utilizing more than only the first supervised component somewhat mitigates the effect of non-optimal selection of alpha. In the original paper of Bair et al. (2006) they used only the first supervised component, and I would expect that to be much more sensitive to how the threshold is chosen.

For iterative supervised PCA I used alpha = 0.01 for the stopping criterion of the supervised iteration. Also here the motivation was that it would be rare to make a false discovery for a given dataset (since the number of supervised iterations is usually small, most of the time only one or two). Again I did not do systematic tests for this threshold, but for most of the datasets I observed that the p-values were typically either very small (<0.001) or much larger than 0.01.

Topic		Replies	Views
Too much shrinkage with horseshoe prior Modeling prior-choice , projpred , horseshoe-prior	5	3128	November 15, 2020
How to extent (regularized) horseshoe prior to multiple outcome setting? Modeling regularization	28	3780	May 8, 2020
Getting bad results with horseshoe prior Modeling fitting-issues	15	2439	May 1, 2018
How to make the regularized horseshoe less strict? Modeling	2	697	October 9, 2020
Horseshoe regularization on matrix factorization not working Modeling techniques	4	221	July 19, 2024

Linear regression with horseshoe prior, not seeing the shrink

Related topics