# Question about the horseshoe prior and the QR decomposition

I wrote a bit of code to understand the QR decomposition better and I have some questions. My program first generates some fake data, then I recover the parameters without the QR decomposition, with the QR decomposition, and with the QR decomposition using the horseshoe prior. As expected, the QR decomposition speeds things up a lot. What I don’t understand is why things slow down so much if I use horseshoe priors. These are the times:

• Without the QR decomposition: 184 seconds
• With the QR decomposition: 77 seconds
• With the QR decomposition and horseshoe priors: 224 seconds

These are the questions that I have:

1. Am I doing something wrong or is this slowdown expected? If the slowdown is expected, could you point me to an explanation?

2. What is the advantage of using the horseshoe prior over just using N(0,1)?

In case it is helpful, I saved a markdown with my code here

Thanks a lot!

1 Like

Horseshoe prior can be hard to fit see https://projecteuclid.org/euclid.ejs/1513306866. Regularised_horseshoe have been proposed as better alternative Bayes Sparse Regression - reg. horseshoe for multinomial model (also harder to fit than normal distribution)

If you assume sparsity over your parameter, N(0,1) will not model it and will lead to poor estimates of the ~ 0 parameters and shrunk estimated for the far-from-0 parameters. See https://betanalpha.github.io/assets/case_studies/bayes_sparse_regression.html

@avehtari feel free to correct me in case you think is appropriate.

3 Likes

Great answer. I’ll just add that there are several examples when horseshoe helps and when not at https://github.com/avehtari/modelselection_tutorial

Once identified the non-zero parameters is it then recommended to refit the model including only those with non-horseshoe priors?

If so in the study above laplace priors, shows the non-zero elements than horseshoe.
Although the quality is different. That would conclude we might by good enough by laplace priors just to identify.

The decision problem of identifying a non-zero parameter is very difficult. If you have uncertainty which parameters are non-zero, then it is recommended to integrate over that uncertainty, that is, just use the full model with all parameters and horseshoe to make predictions.

If for some reason you want make predictions in the future with less covariates or you want to analyse which covariates could be left out, then it is recommended to condition the inference on the full model and consider how to make the optimal inference if some of the covariates are not used for making the predictions. Simply refitting the model would ignore the uncertainty in the left out parameters producing sub-optimal inference after the selection. Much improved performance can be obtained by projecting the full model posterior to the restricted model subspace. See theory in A survey of Bayesian predictive methods for model assessment, selection and comparison, experiments in Comparison of Bayesian predictive methods for model selection, several illustrative examples of using projpred package at my model selection tutorial page, and a StanCon tutorial video

Is there something missing from these two sentences?

https://betanalpha.github.io/assets/case_studies/bayes_sparse_regression.html
from above and the Laplace Prior shows the spikes at the same position.
I thought, why not use for identification the more simple Laplace Prior then?
I understand that it is not guaranteed to have same results with horse-shoe and Laplace
Prior. However its in my experience it is also not for the regularized horseshoe. Depending
on values for slag, df, … I receive different results.

In that case study the non-zero effects have a large magnitude and are easier to spot. You can repeat the experiments by making the magnitude of non-zero effects smaller and smaller, and compare how easy it is to identify them.

Especially in case of of p>>n (where p is the number of parameters and n is the number of observations) there is non-identifiability problem, the posterior can be sensitive to the priors and the focus should be in the predictive distributions which are less sensitive to the priors. Do you also get lot of variation in predictive distributions with changes of values for slag, df etc? Comparison of Bayesian predictive methods for model selection demonstrates the posterior distribution sensitivity (MAP, Median, MPP) and the much better performance using predictive distributions and correct decision theoretical approach of conditioning the inference after the selection also on the full model (proj).

The Laplace does an okay job of selecting out the large slopes in that case study, but the posteriors are still biased below the true values for these large slopes. Moreover, the Laplace also pulls the posteriors for the small values to more extreme values, resulting in larger uncertainty. These behaviors are consistent with theoretical expectations, as discussed in the case study.