Divergent transitions with the horseshoe prior

I am trying to use a horseshoe prior using the parametrization from
Peltola et al. (2014) and Piironen and Vehtari (2017). The situation I am looking at is a randomized clinical trial comparing a treatment with a placebo. There is a pre-treatment (baseline) measurement of the continuous outcome variable and the outcome is the change from baseline in this variable. There are 4 subgroups of patients that might potentially differ in how well they respond to treatment.

The model I want to fit has the following features:

  1. there may be an overall effect of the treatment, I perhaps want to be very mildly skeptical (weakly informative prior centered on no-effect)
  2. primarily the big question is whether the treatment somehow works a lot better in one subgroup, on the treatment by subgroup interaction is where I want to use a horsehoe prior to reflec this
  3. a random effect on the intercept (change from baseline might differ a bit between subgroups, but they are probably somewhat similar - helps with the fact that we have very few patients in each subgroup)
  4. the baseline value has the same effect on the change from baseline in all patients

I have tried to simulate some data that causes the same problems I face with the real data. I keep running into divergent transitions (so I suspect I have trouble sampling the whole posterior). Even when I go to things like adapt_delta = 0.99999999 and stepsize=0.00001 (I even tried more extreme values), I do not seem to be able to get rid of the divergent transitions.

I seem to have gone through the standard proposals of picking a good parameterization (it certainly helped to the “literal” one), increasing the target acceptance probability and forcing NUTS to initially look at really small stepsizes.

I have generated some R code (R and stan code are attached) to generate some similar(ish) fake data and have attached some of the bivariate plots that may indicate what is going on with the divergent transitions.


I am wondering what else I could try.

example_with_simulated_data.R (1001 Bytes)
hs_model3.stan (1.8 KB)
[Please include Stan program and accompanying data if possible]

The horseshoe family of priors are a bit fragile in that you have to carefully choose the right parameterization to avoid divergences, and there are a lot of them. Each student-t can be, for example, reparameterized as a normal-gamma mixture (see the “Optimizing Stan Code for Efficiency” section of the manual for more information). The suggested Stan programs in the appendix of the 2017 paper on the two-scale horseshoe use one and two reparameterizations, respectively but you can sneak one more that might help. It might also be useful to reduce the scale of the “slab” scale too something more reasonable.

Long story short: the horseshoe is a relatively new and complex distribution and we’re still understanding how best to parameterize it in practice.

1 Like

There are two Piironen & Vehtari (2017) papers, and Mike is referring to the second one with the regularized horseshoe. Here’s the link for the paper and code examples [1707.01694] Sparsity information and regularization in the horseshoe and other shrinkage priors

2 Likes

@avehtari writes too many papers.

2 Likes

Thanks, the regularized horseshoe seems like a good choice for me. It does not suffer from the same problems with reasonable priors and I do have some prior belief on how large an effect could possibly be.

Dear Vehtari,
Under the C1 code section for the logistic regression it is suggested that :

data { ...
int<lower=0,upper=1> y[n]; # outputs
... }
transformed parameters { ...
tau = aux1_global * sqrt(aux2_global) * scale_global;
... }
model { ...
y ∼ bernoulli_logit(f); }

But as much as I can see we do not have aux1_global and aux2_global in the parameter section of the regression model and it seems to me some thing is wrong here.
Do you suggest that we should move tau from model section to the transformed parameter section and define new parameters?

3 Likes

Wow, you are the first one to notice that it is wrong! That piece of code refers to the C.2 alternative parameterization which we first had as the default. When we switched to the simpler parameterization in C.1 we forgot to update that part!

Here’s a complete Stan code, also using glm function for faster computation

data {
  int<lower=0> n;				      // number of observations
  int<lower=0> d;             // number of predictors
  int<lower=0,upper=1> y[n];	// outputs
  matrix[n,d] x;				      // inputs
  real<lower=0> scale_icept;	// prior std for the intercept
  real<lower=0> scale_global;	// scale for the half-t prior for tau
  real<lower=1> nu_global;	  // degrees of freedom for the half-t priors for tau
  real<lower=1> nu_local;		  // degrees of freedom for the half-t priors for lambdas
                              // (nu_local = 1 corresponds to the horseshoe)
  real<lower=0> slab_scale;   // for the regularized horseshoe
  real<lower=0> slab_df;
}

parameters {
  real beta0;
  vector[d] z;                // for non-centered parameterization
  real <lower=0> tau;         // global shrinkage parameter
  vector <lower=0>[d] lambda; // local shrinkage parameter
  real<lower=0> caux;
}

transformed parameters {
  vector[d] beta;                     // regression coefficients
  {
    vector[d] lambda_tilde;   // 'truncated' local shrinkage parameter
    real c = slab_scale * sqrt(caux); // slab scale
    lambda_tilde = sqrt( c^2 * square(lambda) ./ (c^2 + tau^2*square(lambda)));
    beta = z .* lambda_tilde*tau;
  }
}

model {
  // half-t priors for lambdas and tau, and inverse-gamma for c^2
  z ~ std_normal();
  lambda ~ student_t(nu_local, 0, 1);
  tau ~ student_t(nu_global, 0, scale_global*2);
  caux ~ inv_gamma(0.5*slab_df, 0.5*slab_df);
  beta0 ~ normal(0, scale_icept);
  
  y ~ bernoulli_logit_glm(x, beta0, beta);
}
generated quantities {
  vector[n] f = beta0 + x*beta;
  vector[n] log_lik;
  for (i in 1:n)
    log_lik[i] = bernoulli_logit_glm_lpmf({y[i]} | [x[i]], beta0, beta);
}
2 Likes