Why does horseshoe prior for a linear model need a non-linear formula

Hello experts!
I am new to brms and I have the following research problem. I am trying to predict lack of motivation (sns_theta) with a childhood trauma questionnaire that has 10 subscales (PVA, NVEA, PPhysA, etc) that are somewhat correlated (rs = 0.1-0.3). I also have some nuisance variables such as age and gender. I’d like to know which of the subscales are most predictive of the behavior. I thought of giving them a horseshoe prior to shrink those that are less predictive. I am wondering if this looks right and I am mainly confused as to why I need to make the formula a non-linear one like so:


formula <- bf(sns_theta ~ a + b, nl=TRUE) +
  lf(a ~ PVA + NVEA + PPhysA + SexA + WSV + WIPV + PeerVA + PeerPhysA+  EN + PN, center = TRUE) + 
  lf(b ~ 0 + age + gender+education, cmc = FALSE)

priors <- c(prior(horseshoe(1), class = "b", nlpar = "a"),
            prior(normal(0, 1), class = "b", nlpar = "b"))

horsehoe_fit <- brm(formula = formula, data = surveys, family = gaussian(),
                    prior = priors, iter = 3000, warmup = 1000, chains = 4, cores = 4,
                    control = list(adapt_delta = 0.999, max_treedepth = 15))

  • Operating System: MAC OS Big Sur 11.6
  • brms Version: 2.15.5

Hello, I’m not sure about this specification. What is it about the model system that is nonlinear? I don’t think there is any need to use a nonlinear model specification for the horseshoe prior, and it probably makes more sense in the linear model context. I was just using it in a meta-analysis which is definitely linear.

Are you sure that you need this path structure to model this system? Why not just just subscales and nuisance predictors all as linear contributors to ‘sns_theta’ in one formula, and then apply the horseshoe prior to only the required parameters?

Thank you AWoodward for your reply. What you suggest is exactly what I wanted to do in the first place. There is nothing about the system that is non-linear and that’s the confusion I have. I don’t want to model any non-linear effects here. That formula is based on an earlier reply here: prior on subset of predictors

When I try test case:


formula <- bf(sns_theta ~ 1 + PVA + NVEA + PPhysA + age)
priors <- c(prior(horseshoe(1), class = "b", coef = PVA),
            prior(horseshoe(1), class = "b", coef = NVEA),
            prior(horseshoe(1), class = "b", coef = PPhysA),
            prior(normal(0, 1), class = "b", coef = age))

# fit 
horsehoe_fit <- brm(formula = formula, data = surveys, family = gaussian(),
                    prior = priors, iter = 3000, warmup = 1000, chains = 4, cores = 4,
                    control = list(adapt_delta = 0.999, max_treedepth = 15))

I get the following message: Error: Prior ‘horseshoe(1)’, ‘horseshoe(1)’, ‘horseshoe(1)’ is used in an invalid context. See ?set_prior for details on how to use special priors.

So I am wondering that even if I am using the brms non-linear syntax it in fact is just a linear model since I have not specified any non-linear effects.

Thanks again for taking the time to reply.

Could you try it as:

I think maybe without the quotations the function horseshoe() is being called, which is not expected in prior().

Thanks again! Unfortunately adding the strings did not solve the issue. I still get the same error message. Even if I switch prior() to prior_string. However, just applying horseshoe to all the predictors does run without complaining. For example:

formula <- bf(sns_theta ~ 1 + PVA + NVEA + PPhysA + age)
priors <- c(prior(horseshoe(1), class = "b"))

But this does give me divergent transitions and suggest to add adapt_delta above .999. Which I am hesitant to do. Also I am not so sure I want to shrink the nuisance variables like age, sex, education etc. I’d much prefer to only shrink those subscales that I know are correlated.

Any other suggestions are welcome! Also I am maybe thinking I might have gotten it right the first time but I just want to be sure about the syntax.

The issue here is that you are trying to apply the horseshoe prior to just one variable at a time. This doesn’t make sense for the horseshoe’s aim. The horseshoe prior is used to select features that are relatively predictive. To do this, you specify a horseshoe shape such that predictors of sufficient effect escape being shrunk while those that are of insufficient effect are shrunk toward zero. The horseshoe prior thus applies to all coefficients of predictors at once. This is why the model runs fine when set to just class = "b" and not to specific coefficients.

This is a fairly well-known issue with the horseshoe prior. You can read more about the prior and its implementation in brms through the information about it on CRAN: horseshoe: Regularized horseshoe priors in 'brms' in brms: Bayesian Regression Models using 'Stan'. As you’ll see, the package author recommends alternative specifications to the horseshoe(1) prior, and he is helpful enough to cite some relevant methods papers on why and how to specify the horseshoe prior.

This doesn’t make sense to me. What is the issue with the correlation? If you have reason to believe that your predictors are highly correlated and may affect your posterior distributions, then you should probably try adding a QR decomposition: brm(bf(..., decomp = "QR")). The horseshoe prior won’t fix correlation, and my guess (though I’m no expert here at all) is that collinearity would actually cause the horseshoe prior to not work particularly well anyway. The correlations you listed in the initial post are not terribly large, so I don’t think collinearity is really that much of an issue; however, you can still formally test that.

This ultimately depends on the point of your model. Is the aim to have a high degree of predictive accuracy for lack of motivation (sns_theta)? If that’s the case, then nuisance variables aren’t really nuisance as they may be useful in predicting that outcome. In this case, feature selection is really important, and you can consider either the horseshoe prior or you could look at the projpred package (Projection Predictive Feature Selection • projpred). Both of these methods aim specifically to take a large number of predictors and find just those that are most “relevant” to predicting the outcome, but both utilize very different methods and represent different approaches to this task.

Your wording of the problem and conceptualization of this issue suggests to me that you’re more interested in inference about the predictors. In this case, I don’t think that any feature selection or dimension reduction is really necessary. The “non-linear” formula you started with is only necessary if you want to include the nuisance variables without any shrinkage while then applying the horseshoe prior to just those subscales that you’re interested in. Like I said before, I don’t think that you need this if the goal is just to conduct statistical inference. For example, if you are hypothesizing that a certain variable has an effect, then it doesn’t really matter if the horseshoe prior shrinks that effect or not. In fact, you can manipulate the horseshoe prior’s arguments to shrink effects of varying sizes, making this an inconsistent indicator of “significance.” Including reasonable priors for your predictors, using a QR decomposition, and then conducting full-posterior inference is superior should be sufficient unless your data don’t support that.

3 Likes

This is incredibly helpful. Thank you! Indeed the correlations are not particularly high. Prediction is not my main goal. I am trying to conduct statistical inference and perhaps the whole idea of shrinkage was misguided. Basic linear model should suffice here. I am planning to replicate these findings in another set so maybe that is why my thinking went towards prediction rather than inference.

1 Like

Hi, Can you share how to solve the error that horseshoe prior didn’t work for non-linear formula? I fit a non-linear model :

fit_brm_wiggle <- brm(Y ~ s(X) +(1 + X | group 1) + (1 + X | group 2),
                      data = dat,
                      family = exgaussian(),
                      chains = 3,
                      iter = 1000,
                      cores = 4,
                      control = list(adapt_delta=0.99),
                      prior=c(prior(horseshoe(1),class="b")),
                      seed = 1345
                      )

But there is a error which is same as yours:

Prior 'horseshoe(1)' is used in an invalid context. See ?set_prior for details on how to use special priors.

Hi JianPeng,
I am not 100% sure what I am looking at here, but is it that you only have one predictor and these hierarchical terms there? I think that you would want to consider the horseshoe in case you have many many predictors (perhaps more than N of observations) that may be highly correlated. What is it that you want to do?

Thanks! Indeed, as you might think, my model has only one predictor. After your suggestion, I think I misused the horseshoe prior here. I just want to find a prior for my data, and the posterior distribution is inconsistent with the data distribution when using flat prior. Could you give me some advise? Thanks again!
Here is my data distribution:
数据分布
posteriori distribution

The prior probably won’t help with your posterior predictive checks (PPC). The prior is on the coefficient of your predictor term, meaning that you are placing prior probability on whether that coefficient is -0.15 or 2.00 or whatever. This exerts relatively little influence (depending on your sample size and the type of prior you use) on the final model’s predictions as those predictions are ultimately the product of the (non-)linear equation you are fitting and the likelihood you’ve told the model your data come from (in this case the ex-Gaussian). You could either try alternative likelihoods, though the ex-Gauss here seems to capture the overall shape decently well, or you need to include additional predictors.

The fact that your model makes posterior predictions that are not perfectly aligned with your observed data may not be entirely unexpected when you are using just a single predictor as there are likely other variables that contribute to the data generation process than your model is incorporating in its prediction. Basically, your model can’t make up information to match your data – you have to tell it what it needs to know about how your data were generated.

Without knowing more about your data and modeling aims, it is hard to provide additional guidance. Depending on your data, you may consider some zero-inflated or hurdle models to account for that very high proportion of 0s in your dependent variable. Those models, however, are most appropriate if your outcome variable is discrete (i.e., takes integer values only).

-----EDIT-----
I see that you’re working on this problem on another post as well. From that post, I realize that I misread the histogram and posterior plot as the peak is not at 0 but at -1. The zero-inflated stuff does not apply in that case without translating your outcome one to the right (i.e., Y+1). Since your other post is specifically about your problem, I recommend continuing to engage there rather than on this thread for a different problem

Thanks very much! I was inspired by your suggestion.

Best regards!