User-friendly implementation of regularised horseshoe

Hello @avehtari and all,

even though I have used regularised horseshoe before, I am getting a bit lost in my code I was wondering if there is a user friendly implementation, not much more complex than

beta ~ double_exponential(0, 1);

The goal is to control better

  • the percentage of “outliers”
  • the scale for outliers

Thanks

1 Like

Hi!

I like Paul’s function in brms! You can get it when specifying a formula with horseshoe prior and displaying the code with make_stancode.

Cheers,
Lucas

Thanks Lucas I don’t use much brms, would you mind when you have sometime to paste a minimal example?

No problem, I’ll do that when I open my computer
:)

1 Like

If I define the following data and model:

library(brms)
library(rethinking)

## Sample the predictors
X <- cbind(1, rmvnorm2(100, Mu = rep(0,100), sigma = rep(1, 100), Rho = rlkjcorr(1, 100, 1)))
## Sample the parameters
beta <- rnorm(101)

## Compute the linear predictor
mu <- X %*% beta
## Sample the observed values
y <- rnorm(100, mu, sd = 0.5)

## Store everything in a data.frame
Dat <- data.frame(y = y)
Dat$X <- X

## Define the formula
formul <- bf(y ~ X)
## Define the priors
priors <- prior(horseshoe(df = 1, par_ratio = 0.5), class= b)

## DIsplay stan code
make_stancode(formul, data = Dat, prior = priors)

I obtain the following stan code, with the prior being scaled by the residual standard-error.

// generated with brms 2.8.8
functions {

  /* Efficient computation of the horseshoe prior
   * Args:
   *   zb: standardized population-level coefficients
   *   global: global horseshoe parameters
   *   local: local horseshoe parameters
   *   scale_global: global scale of the horseshoe prior
   *   c2: positive real number for regularization
   * Returns:
   *   population-level coefficients following the horseshoe prior
   */
  vector horseshoe(vector zb, vector[] local, real[] global,
                   real scale_global, real c2) {
    int K = rows(zb);
    vector[K] lambda = local[1] .* sqrt(local[2]);
    vector[K] lambda2 = square(lambda);
    real tau = global[1] * sqrt(global[2]) * scale_global;
    vector[K] lambda_tilde = sqrt(c2 * lambda2 ./ (c2 + tau^2 * lambda2));
    return zb .* lambda_tilde * tau;
  }
}
data {
  int<lower=1> N;  // number of observations
  vector[N] Y;  // response variable
  int<lower=1> K;  // number of population-level effects
  matrix[N, K] X;  // population-level design matrix
  real<lower=0> hs_df;
  real<lower=0> hs_df_global;
  real<lower=0> hs_df_slab;
  real<lower=0> hs_scale_global;
  real<lower=0> hs_scale_slab;
  int prior_only;  // should the likelihood be ignored?
}
transformed data {
  int Kc = K - 1;
  matrix[N, Kc] Xc;  // centered version of X
  vector[Kc] means_X;  // column means of X before centering
  for (i in 2:K) {
    means_X[i - 1] = mean(X[, i]);
    Xc[, i - 1] = X[, i] - means_X[i - 1];
  }
}
parameters {
  // local parameters for horseshoe prior
  vector[Kc] zb;
  vector<lower=0>[Kc] hs_local[2];
  real temp_Intercept;  // temporary intercept
  // horseshoe shrinkage parameters
  real<lower=0> hs_global[2];
  real<lower=0> hs_c2;
  real<lower=0> sigma;  // residual SD
}
transformed parameters {
  vector[Kc] b;  // population-level effects
  b = horseshoe(zb, hs_local, hs_global, hs_scale_global * sigma, hs_scale_slab^2 * hs_c2);
}
model {
  vector[N] mu = temp_Intercept + Xc * b;
  // priors including all constants
  target += normal_lpdf(zb | 0, 1);
  target += normal_lpdf(hs_local[1] | 0, 1)
    - 101 * log(0.5);
  target += inv_gamma_lpdf(hs_local[2] | 0.5 * hs_df, 0.5 * hs_df);
  target += student_t_lpdf(temp_Intercept | 3, 3, 10);
  target += normal_lpdf(hs_global[1] | 0, 1)
    - 1 * log(0.5);
  target += inv_gamma_lpdf(hs_global[2] | 0.5 * hs_df_global, 0.5 * hs_df_global);
  target += inv_gamma_lpdf(hs_c2 | 0.5 * hs_df_slab, 0.5 * hs_df_slab);
  target += student_t_lpdf(sigma | 3, 0, 10)
    - 1 * student_t_lccdf(0 | 3, 0, 10);
  // likelihood including all constants
  if (!prior_only) {
    target += normal_lpdf(Y | mu, sigma);
  }
}
generated quantities {
  // actual population-level intercept
  real b_Intercept = temp_Intercept - dot_product(means_X, b);
}

If I remove the autoscaling

## Define the priors
priors <- prior(horseshoe(df = 1, par_ratio = 0.5, autoscale = F), class= b)

## DIsplay stan code
make_stancode(formul, data = Dat, prior = priors)

I obtain this code

// generated with brms 2.8.8
functions {

  /* Efficient computation of the horseshoe prior
   * Args:
   *   zb: standardized population-level coefficients
   *   global: global horseshoe parameters
   *   local: local horseshoe parameters
   *   scale_global: global scale of the horseshoe prior
   *   c2: positive real number for regularization
   * Returns:
   *   population-level coefficients following the horseshoe prior
   */
  vector horseshoe(vector zb, vector[] local, real[] global,
                   real scale_global, real c2) {
    int K = rows(zb);
    vector[K] lambda = local[1] .* sqrt(local[2]);
    vector[K] lambda2 = square(lambda);
    real tau = global[1] * sqrt(global[2]) * scale_global;
    vector[K] lambda_tilde = sqrt(c2 * lambda2 ./ (c2 + tau^2 * lambda2));
    return zb .* lambda_tilde * tau;
  }
}
data {
  int<lower=1> N;  // number of observations
  vector[N] Y;  // response variable
  int<lower=1> K;  // number of population-level effects
  matrix[N, K] X;  // population-level design matrix
  real<lower=0> hs_df;
  real<lower=0> hs_df_global;
  real<lower=0> hs_df_slab;
  real<lower=0> hs_scale_global;
  real<lower=0> hs_scale_slab;
  int prior_only;  // should the likelihood be ignored?
}
transformed data {
  int Kc = K - 1;
  matrix[N, Kc] Xc;  // centered version of X
  vector[Kc] means_X;  // column means of X before centering
  for (i in 2:K) {
    means_X[i - 1] = mean(X[, i]);
    Xc[, i - 1] = X[, i] - means_X[i - 1];
  }
}
parameters {
  // local parameters for horseshoe prior
  vector[Kc] zb;
  vector<lower=0>[Kc] hs_local[2];
  real temp_Intercept;  // temporary intercept
  // horseshoe shrinkage parameters
  real<lower=0> hs_global[2];
  real<lower=0> hs_c2;
  real<lower=0> sigma;  // residual SD
}
transformed parameters {
  vector[Kc] b;  // population-level effects
  b = horseshoe(zb, hs_local, hs_global, hs_scale_global, hs_scale_slab^2 * hs_c2);
}
model {
  vector[N] mu = temp_Intercept + Xc * b;
  // priors including all constants
  target += normal_lpdf(zb | 0, 1);
  target += normal_lpdf(hs_local[1] | 0, 1)
    - 101 * log(0.5);
  target += inv_gamma_lpdf(hs_local[2] | 0.5 * hs_df, 0.5 * hs_df);
  target += student_t_lpdf(temp_Intercept | 3, 3, 10);
  target += normal_lpdf(hs_global[1] | 0, 1)
    - 1 * log(0.5);
  target += inv_gamma_lpdf(hs_global[2] | 0.5 * hs_df_global, 0.5 * hs_df_global);
  target += inv_gamma_lpdf(hs_c2 | 0.5 * hs_df_slab, 0.5 * hs_df_slab);
  target += student_t_lpdf(sigma | 3, 0, 10)
    - 1 * student_t_lccdf(0 | 3, 0, 10);
  // likelihood including all constants
  if (!prior_only) {
    target += normal_lpdf(Y | mu, sigma);
  }
}
generated quantities {
  // actual population-level intercept
  real b_Intercept = temp_Intercept - dot_product(means_X, b);
}

Cheers,
Lucas

Thanks,

I wish there was a one line implementation of this ;)

The one-line implementation tends to yield divergent transitions.

1 Like

brms implementation follows what I recommend, too. It has the additional benefit of using non-centered parameterization, which reduces the probability of getting funnel shaped posterior which is more difficult to sample.

1 Like

Just to confirm,

the above implementation of reg_horseshoe does (or does not) have arguments as parameters?

It is analogous of (?)

~ double exponential(0,1)

or

~ double exponential(alpha, beta);
alpha ~ ..
beta ~ ..

I am looking for the case (1) where I impose the arguments and not trying to infer them.

For example, can I set the following parameters?, based on:

  • expected fraction of “putliers”
  • extected scale of outliers
  // local parameters for horseshoe prior
  vector<lower=0>[Kc] hs_local[2];

  // horseshoe shrinkage parameters
  real<lower=0> hs_global[2];
  real<lower=0> hs_c2;

I have compressed the code and moved to functions, and kept just the key three tuning arguments (In my option but I can be wrong), decreasing of more than half the number of code lines in the main body of the model.

  functions{
  
  	 vector horseshoe_get_tp(vector zb, vector[] local, real[] global, real scale_global, real c2) {
  	    int K = rows(zb);
  	    vector[K] lambda = local[1] .* sqrt(local[2]);
  	    vector[K] lambda2 = square(lambda);
  	    real tau = global[1] * sqrt(global[2]) * scale_global;
  	    vector[K] lambda_tilde = sqrt(c2 * lambda2 ./ (c2 + tau^2 * lambda2));
  	    return zb .* lambda_tilde * tau;
  	  }
  
    real horseshoe_get_lp(vector zb, vector[] local, real df, real[] global, real df_global, real c2, real df_slab){
  
      // real<lower=0> hs_df; // == 1  // If divergencies increase this
  	  // real<lower=0> hs_df_global; // == 1
  	  // real<lower=0> hs_df_slab; // == 4 // df of the outliers
  
    	vector[6] lp;
  
    	lp[1] = normal_lpdf(zb | 0, 1);
  	  lp[2] = normal_lpdf(local[1] | 0, 1) - 101 * log(0.5);
  	  lp[3] = inv_gamma_lpdf(local[2] | 0.5 * df, 0.5 * df);
  	  lp[4] = normal_lpdf(global[1] | 0, 1)  - 1 * log(0.5);
  	  lp[5] = inv_gamma_lpdf(global[2] | 0.5 * df_global, 0.5 * df_global);
  	  lp[6] = inv_gamma_lpdf(c2 | 0.5 * df_slab, 0.5 * df_slab);
  
  	  return(sum(lp));
    }
  }
  data{
  int N;
  /// Horseshoe tuning
    real<lower=0> hs_df; // == 1  // If divergencies increase this
    real<lower=0, upper=1> par_ratio; // real<lower=0> hs_scale_global; // from par ratio   !! KEY PARAMETER
    real<lower=0> hs_scale_slab; // == 2 // regularisation/scale of outliers                !! KEY PARAMETER
  }
  parameters{
    vector[1] zbeta;
    // Horseshoe
    vector<lower=0>[1] hs_local[2]; // local parameters for horseshoe prior
    real<lower=0> hs_global[2]; // horseshoe shrinkage parameters
    real<lower=0> hs_c2; // horseshoe shrinkage parameters
  }
  transformed parameters {
  	// Horseshoe
  	vector[1] beta= horseshoe_get_tp(zbeta, hs_local, hs_global, par_ratio / sqrt(N), hs_scale_slab^2 * hs_c2);
  
  }
  model{
         // Horseshoe
  	target += horseshoe_get_lp(zbeta, hs_local, hs_df, hs_global, 1, hs_c2, 4);
  }

In the above code the parameters are given as data

  real<lower=0> hs_df;
  real<lower=0> hs_df_global;
  real<lower=0> hs_df_slab;
  real<lower=0> hs_scale_global;
  real<lower=0> hs_scale_slab;

See Sparsity information and regularization in the horseshoe and other shrinkage priors how these different parameters affect the distribution.

Yes, see Sparsity information and regularization in the horseshoe and other shrinkage priors how to set the above parameters based on that information.

1 Like

Thanks for the article. I think brms had a nice description as well (please correct me if there is any misunderstanding)

  • par_ratio = Ratio of the expected number of non-zero coefficients to the expected number of zero coefficients. So if i expect 1% of non-zero coefficient this ratio will be 1/99

  • hs_scale_slab = the expected value of the non-zero coefficient. For example if I expect the coefficient all being 0 except few being 2; 2 will be the value of this parameter