Yup. I’ve implemented a similar thing for a few predictors by marginalising over all combinations of predictors just to see what would come up. I’d be willing to revive/improve that code if there was interest in comparing horseshoe to spike-slab for moderate dimensions (it’s not a fair comparison but it’s all we can do, really).
I will try this! Right now I’m setting this value to M0/((M-M0)*sqrt(N)) which is multiplied by sigma in the code. This results in a global scale of 0.01 which is already pretty small I think. But I will try something even smaller like 0.005.
The issue with the weird results only occurs after setting the non-zero coefficients to 0.1. Even with 0.2 there were no problems and all non-zero coefficients were identified well. So I’m not sure if the horseshoe is really the wrong model. I think (or better: I hope) it’s only misspecified in my example with respect to the small effect sizes we are expecting :)
I’m implementing the spike-and-slab as follows. (In the example above I was using a 1xM matrix for delta instead of a vector because the real model I run is much more complex with originally more than one predictor matrices. If you want or if it helps I can also share the original code for the complete model but its not commented really good).
I think one “uniqueness” is using the t distribution as the spike instead of a normal. To be honest, using something like normal(0, 0.5) didn’t give results as good as the example above.
EDIT: I tried to build a simple model with the same true values like above and indeed I could not replicate my own results from my previous post. So maybe the more complex model I’ve run is a special case where the spike-and-slab behaves well? I don’t know. I will put the code for the complex model in a new post.
EDIT 2: I will post some further experiments on the spike-and-slab tomorrow. I’m pretty confused now. Sometimes it runs perfectly. Sometimes it doesn’t run at all.
code <- "data {
...
int <lower=1> M; // number of moderator variables (we use the variable selection for the second stage effects/interaction effects only)
...
}
parameters {
//coefficients
...
vector[M] delta; //effects of moderator on stock variable effect
...
real <lower=0, upper=1> theta;
}
model {
...
theta ~ beta(4, 4);
for(m in 1:M) {
target += log_mix(theta, normal_lpdf(delta[m] | 0, 0.01), student_t_lpdf(delta[m] | 4, 0, 0.25));
}}
...
}"
I will try this too! Do you mean completely deleting all the zero predictors from the data simulation?
That is not the usual spike-and-slab model. Here you have just one theta, and thus you have all covariates with the same weight in spike or slab. For spike-and-slab you should have an indicator vector of length M. The difficulty is that this the each element of that vector is a discrete binary parameter (which Stan doesn’t support) and you need to integrate over all possible combinations (which you could in theory do by summing, but due to the combinatorial explosion it’s unfeasible with approximately M>10). For just one indicator variable you can integrate over the discrete space to get the probability of slab which is theta in your model, but that is not then having the desired effect that different variables would have different shrinkage.
Yes. This will give you the edge case that what if you would have Oracle telling you which variables are non-zero, how well you could estimate the non-zero co-efficients.
Sorry for my late reply. I found out that there was a bug in my code, that’s why my experiments didn’t work out and I couldn’t replicate my example from above.
Thank you very much for the explanation! To be honest, I thought implementing the spike-and-slab prior would be nothing else than applying the logic behind fitting a finite mixture model (with known distribution parameters) to the coefficient vector as prior instead of the outcome vector as likelihood. Actually, I just adjusted what it says in the Stan manual (5.4 Vectorizing Mixtures)
“A proper mixture at the observation level is defined as follows, where we assume that lambda, y[n], mu[1], mu[2], and sigma[1], sigma[2] are all scalars and lambda is between 0 and 1.
for (n in 1:N) {
target += log_sum_exp(log(lambda) + normal_lpdf(y[n] | mu[1], sigma[1]),
log1m(lambda)
+ normal_lpdf(y[n] | mu[2], sigma[2]));
or equivalently
for (n in 1:N)
target += log_mix(lambda,
normal_lpdf(y[n] | mu[1], sigma[1]),
normal_lpdf(y[n] | mu[2], sigma[2]));
This definition assumes that each observation y_n may have arisen from either of the mixture components.”
I thought transferring this from the likelihood to the prior would be right since we are assuming that each coefficient \delta_m may have arisen from either the spike or the slab. The theta (or lambda in the Stan manual) reflects then the mixture proportion, so what is the proportion of coefficients (or observations) within each mixture. Or am I totally wrong?
Here are the experiments I’ve run for my implementation of the spike-and-slab and horseshoe prior. It’s just a simple regression of y on a lot of predictors. Further I’ve added a model using the normal prior including and excluding the zero coefficients. I put everything into a markdown file since it’s way too much code for a post here.
There is just one lambda for all y. If you make the lambda to favor the spike then all y are assumed to come from the spike. If you make the lambda to favor the slab, then all y are assumed to come from the slab. If lambda=0.5 then each y has probability of 0.5 of coming from spike or slab.
It is technically valid, but not what you want. There is just one lambda for all delta. If you make the lambda to favor the spike then all delta are assumed to come from the spike. If you make the lambda to favor the slab, then all delta are assumed to come from the slab. If lambda=0.5 then each delta has probability of 0.5 of coming from spike or slab, but you are not learning which deltas are more likely to come from the spike than from the slab.
Maybe @maxbiostat you could post the code just to illustrate what would be needed for spike-and-slab and why it’s difficult to scale it.
Right, I am going to post a silly little program I wrote a little while ago. Running this actually involves quite a bit of R code, which I can post here or chuck on to GitHub if there is interest.
Nevermind the tau parameters in the Stan program, those are just to keep things kind of compatible with George & McCulloch, (1993).
Stan program involves pre-computing all binary sequences of size P and then using log_sum_exp to marginalise over them. If @avehtari would be interested in testing this against horseshoe for moderate dimensions, maybe we could concoct something using map_rect. I know there’s no competition really, but I naively think it’d be nice to have a little comparison ceteris paribus, in Stan. Maybe such a comparison already exists implemented in another language such as Nimble?
Cool. I’ve put the code on GitHub. Please heed the warnings I’ve scattered through the repo; it’s been a while since I wrote this and even back then it wasn’t a serious exercise. So the model might be wrong or at least incompatble with how BSSVS is done in practice.
As I said, if there’s (scientific) interest, I’m willing to devote a bit more attention to this as it’s quite close to some of my own research interests.
I believe that your spike and slab code isn’t implemented correctly as is. If you go to the top right of page 882 of http://www.snn.ru.nl/~bertk/machinelearning/2290777.pdf (George and McCulloch 1993), they describe the induced prior on the regression coefficients as a mixture of multivariate normals. What you’ve implemented is a mixture of regression models (actually not exactly sure what the right description is here). I think in order to implement it correctly you need to implement the mixture at the level of the regression coefficients. So if there are p covariates and M possible sparsity patterns you need M*p parameters representing the regression coefficients for each sparsity pattern, the marginalize over the sparsity patterns to induce a prior on p regression coefficients.