Understanding Regularized HS prior

I am struggling to understand the various arguments of the hs() prior in rstanarm. I read through Piironen, Vehtari 2017 but I’m still struggling with some things. The calculation of tau0, or the global intercept, makes sense. But I’m having trouble with the others hyperparameters, though I find adjusting them does affect my predictive accuracy. I find the term “slab” particularly confusing. What would be nice is a sense of how the various hyperparameters affects the model. For example, one argument is simply ‘df’. I am unsure what exactly this refers to as there is also a global_df and a slab_df IIRC. If I do a prior_summary call to my model in rstanarm, this df appears as the prior to coefficients. I’m working on a n<<p problem and finding many of my coefficients have very heavily skewed distributions, though I don’t seem to be having any sampling issues. It makes me suspicious some of my priors are reigning the estimates in too much but perhaps I’m misdiagnosising the issue.

2 Likes

hs refers to hierarchical shrinkage, and to get regularized horseshoe set df=1, global_df=1. If you want sparsifying prior, it’s best to leave them like that. Then you need to choose global_scale, slab_df, and slab_scale.

That is the global_scale and it seems you have that figured out.

slab describes the prior for the large coefficients. slab is t-distribution and has scale and df, which you can choose based on your prior information about the magnitude of large coefficients. df is local df or local nu. slab is important, for example, in logistic regression with separable classes as without regularization horseshoe would have way too much mass for unfeasible large weights.

and which likelihood?

posterior distributions? It’s fine if the posterior is skewed.

1 Like

I had a similar problem with understanding the hs() prior. When global_df =1 , does that mean that \tau \sim C^+(0,\tau_0^2)? And how is df related to the model as in Piironen, Vehtari (2017)?

It should be that hs() is using half student_t priors. The student_t distribution with df=1 is equivalent to a Cauchy distribution, that is the reason of setting df=1 and global_df=1.

Yes i think that as well, however I can’t match the 5 inputs of hs() with the parameters of the regularised horseshoe. I assume hs() should match the following model:

c^2 \sim \text{inv-gamma}(\frac{\nu}{2},\frac{\nu s^2}{2} )

Where \nu is slab_df and s is slab_scale
\lambda_i \sim C^+(0,1)

\hat\lambda_i^2 = \frac{c^2\lambda_i^2}{c^2 + \tau^2\lambda_i^2}

\tau \sim \text{student-t}_{\nu-global}^+(0, \tau_0^2)

Where \nu_{global} is global_df and \tau_0^2 is global_scale.

\beta_i \sim N(0,\tau\hat\lambda_i)
So that are 4 parameters in total and not 5 as in hs()

As far as I know:

  • global_scale and global_df are for \tau (student-t distributed)
  • slab_df and slab_scale are for c^{2} (not sure how the inv-gamma is parametrised)
  • df refers to the degree of freedom for \lambda_{i} which is still distributed as a student-t. For (regularised) horseshoe you want it to be df=1 in order to have a Cauchy distribution

I suggest you to look at the Stan implementation in Section C.1 (Appendix C) of Piironen, Vehtari (2017). It might not correspond exactly to the actual implementation in rstanarm though.

1 Like

hs() refers to Hierarchical Shrinkage (not HorseShoe). Hierarchical Shrinkage has
\lambda_i \sim t_\nu^+(0,1)
Thus five parameters of hs() define regularized hierarchical shrinkage prior, but if df=1 it defines regularized horseshoe. See [1508.02502] Projection predictive variable selection using Stan+R and Sparsity information and regularization in the horseshoe and other shrinkage priors for more information.

2 Likes