Including prior information about regression coefficient being non-zero

daniel_h · February 29, 2020, 1:27am

I’m working on a problem that is basically a logistic regression where most of the \beta coefficients are either 0 or something else and I have some additional data that may act as prior information about which of the \beta are non-zero.

In an ideal world, the model I would like to fit is:

y \sim \textrm{bernoulli_logit}(\beta_{0} + X(\beta\,.*\,z))
z_{j} \sim \textrm{bernoulli}(\pi_{j})
\pi_{j} = \textrm{inv_logit}(\gamma_{0} + \gamma_{1}x_{1j} \dots)

where X is a n \times p matrix of covariates (n > p),
\beta is the p dimensional vector of coefficients and
z is a p dimensional vector of binary parameters.

If I understand it correctly, this is identical to putting a spike and slab prior on \beta.
Unfortunately p is about 70, so this would require me to marginalize over 2^{70} combinations of z.

Right now, I see two possibilities to solve this issue:

Make the marginalization more efficient. My gut tells me that this should be possible, because X is relatively sparse. Its entries are all 0 or 1 and each row of X contains exactly 6 1’s, which means that only a handful of \beta's contribute to the loglikelihood for that observation, but I don’t know how to utilize that fact (if it is at all possible).
Use a continuous variant of the spike and slab prior instead. Normally I would just use a regularized horseshoe prior, but I don’t know how to include the data on the inclusion probabilities there, as the usual parameterization is in terms of total number of non-zero coefficients, but here I need to be able to tell the model that some coefficients are more likely to be non-zero than others.
?

I would be happy for any input. Either of the above options would be fine (or even some entirely different approach), I just need some direction on what to focus on.

Best regards,
Daniel

andre.pfeuffer · February 29, 2020, 10:33am

Let’s start with 3.
3 You could use rebar, see https://github.com/howardnewyork/rebar
and specify alpha ~ dirichlet(priors_alpha);
with priors_alpha = alpha ./ sum(alpha) * scale;
and alpha > 0, scale > 0.

2 regularized horseshoe https://projecteuclid.org/euclid.ejs/1513306866
C.2 alternative parameterization, you could adjust tau.

tau = aux1_global * sqrt ( aux2_global ) * scale_global * sigma ;
scale_global = par_ratio * sigma / sqrt(N);

Change par_ratio to par_ratio_j, and tau would become tau_j.
with par_ratio_j your probability believe, eg sum(par_ratio_i) = 1.
Maybe @avehtari can help here to verify whether this sounds good.

avehtari · March 1, 2020, 9:52am

What is x_{1j}? Are they part of X or separate?

Based on the equations you don’t actually have prior on inclusion probabilities directly, but x which you think might help to figure which probabilities are similar.

Put the prior on \lambda (or \log(\lambda)) which are the local scale parameters. Using a regression model you would have the model for which local scale parameters are similar. Related idea has been used to put prior on lambdas connecting lambdas for main and interaction effects.

daniel_h · March 1, 2020, 11:10am

Thank you both!
Sorry I’m bad at naming things. The x_{1j} are separate. They are additional data that are informative about the inclusion probabilities, which means that if I observe large values of x_{1j}, I expect it to be more likely that z_{j} = 1.

So something like log(\lambda_{j}) = \gamma_{0} + \gamma_{1j}x_{1j} \dots ?

avehtari · March 1, 2020, 6:07pm

That maybe too strict and instead you would need to modify 1 in prior for lambda

lambda ∼ student_t ( nu_local , 0, 1);

daniel_h · March 1, 2020, 6:45pm

Thank you, this really helped! I’ll try it out and see how it goes.

daniel_h · March 5, 2020, 7:57pm

Just a small follow-up question:

In the original horseshoe paper the shrinkage coefficient is given as \kappa_{j} = \frac{1}{1 + \tau^2\lambda_{j}^2}, but in https://arxiv.org/pdf/1610.05559.pdf (@avehtari) the shrinkage factor for the horseshoe is supposed to be \kappa_{j} = \frac{1}{1 + n\sigma^{-2}\tau^2s_{j}^2\lambda_{j}^2}.

I know that the Carvalho paper uses \tau=\sigma^2=1 and unit variance for the predictors (probably for easier notation), but why is the dependence on n missing?

avehtari · March 6, 2020, 8:44am

You have to ask that from the authors of that paper and I don’t think they read Stan discourse. They once told us that \sigma is missing by accident in their first paper, but I can’t easily find that correspondence.

daniel_h · March 6, 2020, 10:12am

Thank you, that already helps! I was just curious.

Topic		Replies	Views
Sparsity information and regularization in the horseshoe and other shrinkage priors now in EJS Publicity	0	891	December 18, 2017
Spike and slab Modeling	4	4961	May 18, 2020
How to make regularized horse shoe prior asymmetric around mean? Modeling	3	490	July 30, 2020
Bayes Sparse Regression - reg. horseshoe for multinomial model Modeling techniques	17	4518	August 28, 2019
Impact of the par_ratio option of the horseshoe prior brms regularization	3	1225	April 27, 2020

Including prior information about regression coefficient being non-zero

Related topics