I’m working on a problem that is basically a logistic regression where most of the \beta coefficients are either 0 or something else and I have some additional data that may act as prior information about which of the \beta are non-zero.

In an ideal world, the model I would like to fit is:

where X is a n \times p matrix of covariates (n > p), \beta is the p dimensional vector of coefficients and z is a p dimensional vector of binary parameters.

If I understand it correctly, this is identical to putting a spike and slab prior on \beta.
Unfortunately p is about 70, so this would require me to marginalize over 2^{70} combinations of z.

Right now, I see two possibilities to solve this issue:

Make the marginalization more efficient. My gut tells me that this should be possible, because X is relatively sparse. Its entries are all 0 or 1 and each row of X contains exactly 6 1’s, which means that only a handful of \beta's contribute to the loglikelihood for that observation, but I don’t know how to utilize that fact (if it is at all possible).

Use a continuous variant of the spike and slab prior instead. Normally I would just use a regularized horseshoe prior, but I don’t know how to include the data on the inclusion probabilities there, as the usual parameterization is in terms of total number of non-zero coefficients, but here I need to be able to tell the model that some coefficients are more likely to be non-zero than others.

?

I would be happy for any input. Either of the above options would be fine (or even some entirely different approach), I just need some direction on what to focus on.

Let’s start with 3.
3 You could use rebar, see https://github.com/howardnewyork/rebar
and specify alpha ~ dirichlet(priors_alpha);
with priors_alpha = alpha ./ sum(alpha) * scale;
and alpha > 0, scale > 0.
2 regularized horseshoe https://projecteuclid.org/euclid.ejs/1513306866
C.2 alternative parameterization, you could adjust tau.
tau = aux1_global * sqrt ( aux2_global ) * scale_global * sigma ;
scale_global = par_ratio * sigma / sqrt(N);
Change par_ratio to par_ratio_j, and tau would become tau_j.
with par_ratio_j your probability believe, eg sum(par_ratio_i) = 1.
Maybe @avehtari can help here to verify whether this sounds good.

Based on the equations you don’t actually have prior on inclusion probabilities directly, but x which you think might help to figure which probabilities are similar.

Put the prior on \lambda (or \log(\lambda)) which are the local scale parameters. Using a regression model you would have the model for which local scale parameters are similar. Related idea has been used to put prior on lambdas connecting lambdas for main and interaction effects.

Thank you both!
Sorry I’m bad at naming things. The x_{1j} are separate. They are additional data that are informative about the inclusion probabilities, which means that if I observe large values of x_{1j}, I expect it to be more likely that z_{j} = 1.

So something like log(\lambda_{j}) = \gamma_{0} + \gamma_{1j}x_{1j} \dots ?

In the original horseshoe paper the shrinkage coefficient is given as \kappa_{j} = \frac{1}{1 + \tau^2\lambda_{j}^2}, but in https://arxiv.org/pdf/1610.05559.pdf (@avehtari) the shrinkage factor for the horseshoe is supposed to be \kappa_{j} = \frac{1}{1 + n\sigma^{-2}\tau^2s_{j}^2\lambda_{j}^2}.

I know that the Carvalho paper uses \tau=\sigma^2=1 and unit variance for the predictors (probably for easier notation), but why is the dependence on n missing?

You have to ask that from the authors of that paper and I don’t think they read Stan discourse. They once told us that \sigma is missing by accident in their first paper, but I can’t easily find that correspondence.