Getting bad results with horseshoe prior

I took the sample code at the end of this paper, copied below:

data {
  int<lower=0> n;               // number of observations
  int<lower=0> d;               // number of predictors
  vector[n] y;                  // outputs
  matrix[n,d] x;                // inputs
  real<lower=0> scale_icept;    // prior std for the intercept
  real<lower=0> scale_global;   // scale for the half-t prior for tau
                                // ( tau0 = scale_global*sigma )
  real<lower=1> nu_global;      // degrees of freedom for the half-t prior for tau
  real<lower=1> nu_local;       // degrees of freedom for the half-t priors for lambdas
                                // ( nu_local = 1 corresponds to the horseshoe )
parameters {
  real beta0;        // intercept
  real logsigma;     // log of noise std
                     // auxiliary variables that define the global and local parameters
  vector[d] z;
  real<lower=0> r1_global;
  real<lower=0> r2_global;
  vector<lower=0>[d] r1_local;
  vector<lower=0>[d] r2_local;
transformed parameters {
  real <lower=0> tau;         // global shrinkage parameter
  vector <lower=0>[d] lambda; // local shrinkage parameters
  vector[d] beta;             // regression coefficients
  vector[n] f;                // latent values
  real sigma;                // noise std
  sigma = exp(logsigma);
  lambda = r1_local .* sqrt(r2_local);
  tau = r1_global * sqrt(r2_global);
  beta = z .* lambda*tau;
  f = beta0 + x*beta;
model {
  // half -t priors for lambdas
  z ~ normal (0, 1);
  r1_local ~ normal (0.0 , 1.0);
  r2_local ~ inv_gamma (0.5* nu_local , 0.5* nu_local);
  // half -t prior for tau
  r1_global ~ normal (0.0 , scale_global * sigma);
  r2_global ~ inv_gamma (0.5* nu_global , 0.5* nu_global);
  // gaussian prior for the intercept
  beta0 ~ normal (0, scale_icept);
  // observation model
  y ~ normal (f, sigma);

I have simulated 30 observations from 20 non-zero coefficients and 180 zeros as follows:

# simulation function
normsim <- function(n, betas, disp){
  x <- apply(t(rep(n,length(betas)-1)), 2, rnorm)
  mu <- betas[1] + x%*%betas[-1]
  y <- rnorm(n, mu, sqrt(disp))

# do it
n <- 30
betas <- c(2,
disp <- 0.01

d <- normsim(n,betas,disp)

# estimation

  # calculating recommended scale_global
p0 <- 20
D <- 180
sg <- p0/((D-p0)*sqrt(n))

list(n = length(d[,1]), d = ncol(d[,-1]),
     y = d[,1], x = d[,-1],
     scale_icept = 5, scale_global = sg,
     nu_global = 1, nu_local = 1) -> d.stan

stan(file = "normal-hs.stan",
     data = d.stan, iter = 2000, chains = 4,
     control = list(adapt_delta = 0.991, max_treedepth = 11)) -> fit.hs.norm

I’m not sure what to do with adapt_delta and max_treedepth since any given set of values will alternate between not giving any issues and doing so. But regardless of anything I’ve tried so far, traceplots and coefficient posteriors aren’t looking good.

Is this just a case of not having enough information for proper estimation or are there any further adjustments that could be made?

I am experiencing my own set of issues with horseshoes, but could you maybe expand on the specific problems you are observing in the traces? Does the trace of each individual chain look alright? Or how does it misbehave?

Hm, you say that you are using horseshoes, but you don’t have a single (half-) Cauchy distribution in your model, so that seems a bit off? You can get the Cauchy as the square root of a ratio of Gamma variables, but you only seem to have one inverse gamma in your model.

There is also currently no prior on logsigma, but that might be because you want it to have an improper uniform prior. A proper prior might help with robustness.

30 observations for 20 non-zero coefficients is a really small number, and it’s likely that the posterior is really difficult. Using regularized horseshoe prior Sparsity information and regularization in the horseshoe and other shrinkage priors might help, but it’s likely to still be a difficult problem.



Hm, you say that you are using horseshoes

I’ve used the specification as laid out in Juho’s paper. According to it, splitting the Cauchy (or t) into normals and gamma-invs gives better results.

There is also currently no prior on logsigma

As I understand it, no prior on logsigma is equivalent to uniform probability over the reals, which I take to mean that P(\log \sigma < 0) = 0.5 = P(\sigma < 1), implying a weakly informative prior. In any case, I just tightened its prior quite a bit more by using \sigma \sim \Gamma(0.1,0.1) and the results haven’t improved much (beta0 should be around 2, beta1:10 around 10):


My real world problem has 30 observations and about 300 variables with what is likely to be a gamma distribution. I assume that if things aren’t working for this simpler scenario then I’ll just need to recommend my colleagues they get more data (although opposed to my simulation, I expect correlations between variables which I suppose will help).

I guess my real question is, is there a principled approach to realizing that no analysis-side fiddling will improve the situation? Stan keeps telling me to raise adapt_delta and max_treedepth, but I assume that’s already a bad symptom.

If I can be confident that the results here are due to low information and not model implementation issues, then I wouldn’t expect other alternatives (e.g. Elastic Net) to produce useful results either, which is an important argument to make to my colleagues who are trying other non-Bayesian approaches in parallel.

The traces look alright now? But yes, it seems very much stuck in the prior. You can try to change the global scale a bit to loosen the sparsity constraint - the current setting seems to be too aggressive.
For your “real question”, @anon75146577 was circulating this paper the other day, which I believe offers a procedure where you can verify (at some computational cost) that you are sampling from the appropriate posterior.


You can’t learn more effective number of parameters than the number of observations and it’s usually very difficult to learn more than half, and with high noise or binary target even less. So in you case I would assume you can learn max 15 parameters, but you had more than 15 independent parameters so even if the HS is able to turn off 180 variables, you still don’t have enough information. I think you should recommend obtaining more data. You can try what happens if you would simulate more data. To understand different limits you can also try with just one non-zero coefficient or with less irrelevant ones. You mentioned that real data might have correlations, but plain HS is not modeling that.


Incidentally, how many parameters do you think you could handle in a sparse-factor sparse-loading factor model with ordinal likelihood? Any additional prior assumptions you can employ to handle more parameters?

Depends on the number of observations and the effective number of parameters. There is no easy answer, but you could do prior predictive simulations like in Sparsity information and regularization in the horseshoe and other shrinkage priors to estimate the effective complexity of your model and you can’t learn a model that is more complex than the number of observations and less depending on the likelihood. I don’t have experience with ordinal likelihood, but you can try how different simulated data affects the difference between prior and posterior differ or look at the difference between lpd and elpd_loo (aka p_loo). I’m very interested to learn the results of that kind of experiments.

Back with some general questions on methods.

After noticing the highly correlated nature of my variables, I went to caret to quickly test a few other methods. Partial Least Squares gave the best results and the theory behind it also seems to fit the problem quite well.

However, over the weekend my colleague decided to do good old p-value based forward stepwise selection and ended up with a 7 variable model that outperforms anything I’ve tried so far with half the RMSE.

Given that cross-validation shows uniformly good predictive performance and fairly stable model coefficients, I’m left scratching my head as it seems a simple brute force approach has worked better than other more principled methods (their plain linear regression also worked better than any transformation or distribution change that attempted to take into account the data being non-negative). This also in spite of the selected variables showing high (>0.95) correlation among themselves.

So, given that due diligence seems to have been conducted in order to ensure the stepwise-selected model is robust, should we call it a day and move forward? Perhaps there is something else that should be done or some other method I’ve yet to try? (their adjusted R2 is already nearing 0.9 so it doesn’t even seem there is much room for improvement)

RMSE with respect to an independent test data or the true function?

There’s a few more details in the CV post by this user:

Very interested to see how this plays out since there’s so many red flags for everything I’ve learned about out of sample prediction: very high R2 on a n << p dataset fit with p-value driven model selection. Bonus points for stable estimates on cross-validation with outrageously correlated variables. I would be really interested this model performs in a simulated scenario where you know the true coefficients or vs a hold out data set (assuming cv was loo from the sounds of your CV post?)

CV is unbiased only for one model. As soon as you do model selection, it is not unbiased and selection process can severely overfit. For example, our paper has several variable examples showing both CV estimate and independent test data estimate.

This is impossible to be sustained. I doubt it was rigorously validated. Stepwise regression is most often a disaster.

On the big picture regarding effective sample size, you can quickly demonstrate using frequentist model fits that when the sample size is not huge the list of predictors selected by lasso or horseshoe-type approaches is extremely volatile. You might achieve decent predictive accuracy overall (but typically not as accurate as methods that don’t discard predictors), but the interpretation of the resulting smaller model is very suspect because of the attempt to “name names” of important predictors. In my RMS course I use the analogy of Maxwell’s demon when discussing this. The volatility problems that are so easy to demonstrate with quick frequentist solutions will carry over to Bayes unless you bring information that was not available to the frequentist model.


Is the idea here that once you’re fitting multiple models with CV as criteria for selection you’re essentially fitting your model to the CV (and overfitting it)?