I wrote a bit of code to understand the QR decomposition better and I have some questions. My program first generates some fake data, then I recover the parameters without the QR decomposition, with the QR decomposition, and with the QR decomposition using the horseshoe prior. As expected, the QR decomposition speeds things up a lot. What I don’t understand is why things slow down so much if I use horseshoe priors. These are the times:
Without the QR decomposition: 184 seconds
With the QR decomposition: 77 seconds
With the QR decomposition and horseshoe priors: 224 seconds
These are the questions that I have:
Am I doing something wrong or is this slowdown expected? If the slowdown is expected, could you point me to an explanation?
What is the advantage of using the horseshoe prior over just using N(0,1)?
In case it is helpful, I saved a markdown with my code here
If you assume sparsity over your parameter, N(0,1) will not model it and will lead to poor estimates of the ~ 0 parameters and shrunk estimated for the far-from-0 parameters. See Bayes Sparse Regression
@avehtari feel free to correct me in case you think is appropriate.
Once identified the non-zero parameters is it then recommended to refit the model including only those with non-horseshoe priors?
If so in the study above laplace priors, shows the non-zero elements than horseshoe.
Although the quality is different. That would conclude we might by good enough by laplace priors just to identify.
The decision problem of identifying a non-zero parameter is very difficult. If you have uncertainty which parameters are non-zero, then it is recommended to integrate over that uncertainty, that is, just use the full model with all parameters and horseshoe to make predictions.
Sorry about my bad english. I just just looked https://betanalpha.github.io/assets/case_studies/bayes_sparse_regression.html
from above and the Laplace Prior shows the spikes at the same position.
I thought, why not use for identification the more simple Laplace Prior then?
I understand that it is not guaranteed to have same results with horse-shoe and Laplace
Prior. However its in my experience it is also not for the regularized horseshoe. Depending
on values for slag, df, … I receive different results.
In that case study the non-zero effects have a large magnitude and are easier to spot. You can repeat the experiments by making the magnitude of non-zero effects smaller and smaller, and compare how easy it is to identify them.
Especially in case of of p>>n (where p is the number of parameters and n is the number of observations) there is non-identifiability problem, the posterior can be sensitive to the priors and the focus should be in the predictive distributions which are less sensitive to the priors. Do you also get lot of variation in predictive distributions with changes of values for slag, df etc? Comparison of Bayesian predictive methods for model selection demonstrates the posterior distribution sensitivity (MAP, Median, MPP) and the much better performance using predictive distributions and correct decision theoretical approach of conditioning the inference after the selection also on the full model (proj).
The Laplace does an okay job of selecting out the large slopes in that case study, but the posteriors are still biased below the true values for these large slopes. Moreover, the Laplace also pulls the posteriors for the small values to more extreme values, resulting in larger uncertainty. These behaviors are consistent with theoretical expectations, as discussed in the case study.