# Regularized Horseshoe prior posterior credible intervals

I am trying to apply regularized horseshoe prior to a simulation setting with a binary outcome.
I am using the R stan code for logistic regression from Piironen &Vehtari (2017) paper.
The posterior credible interval always includes zero whatever the simulation setting is. The first five "betas " are high so they should be realized. Later on, my plan is to apply to gene-expression data with at least 1000 genes (covariates), The paper mentions that the method works for p>n cases. I couldn’t understand how the “betas” are chosen to be significant. Kindly help.
My rcode follows:
################ simulated data#########################
beta<-c(rep(2,5),rep(0,95))

n<-20
p<-100

x1<-matrix(rnorm(n*p),nrow=n,ncol=p,byrow=F)
dim(x1)

cor=0.3; #%correlation

for(i in 2:5)
{
x1[,i]<-cor%%x1[,1]+(1-cor)%%x1[,i]
}

l=(x1%%beta+0.3rnorm(n,0, 1));

prob=exp(l)/(1 + exp(l));
prob[which(prob>=0.5)] =1;
prob[which(prob<0.5)]=0;
y1=prob;
y1
data<-cbind(y1,x1)
dim(data)
colnames(data)<-c(“y”,paste0(‘x’,1:((dim(data)[2]-1))))
data[1:10,1:10]
dim(data)
#####################################################
library(rstan)
rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())
Sys.setenv(LOCAL_CPPFLAGS = ‘-march=native’)
path<-path
sigma<-1; p0<-4; p<-p; n<-n;
scale_global<- (p0/(p-p0))*sigma/sqrt(n)

fit<-stan(file=paste(path,"/sta.stan",sep=""),
data=list(n=n,d=p,y=data[,1],x=data[,-1],
scale_icept=10,scale_global=scale_global,nu_global=1,nu_local=1,
warmup=1000,iter=5000,chains=4,
##Here, nu_global=1, and nu_local=1, corresponds to horseshoe prior for both global and local ##parameters.
######################################################################
############posterior mean and credible interval###########################
post_beta <-extract(fit)\$beta
m_beta<-apply(post_beta,2,mean)
ci<-t(apply(post_beta,2,quantile,probs = c(0.05,0.5,0.95)))
ci
#############################################################

Hi AB25,

It seems you are simulating correlating covariates. In that case the posterior credible intervals can includes zero for non-zero betas. See, e.g.

1 Like

This is a data where there is no correlation.
sim<-function(p=p,n=n,s=s,m1=m1,s1=s1,m2=m2,s2=s2){
x<-matrix(rnorm(n*p),nrow=n,ncol=p) # data matrix
gama<-rbinom(p,1,s) # gama telling which genes are significant
d1<-rnorm(p,m1,s1) # two distributions for betas
d2<-rnorm(p,m2,s2)

beta<-d1*gama+(1-gama)*d2 # 20% significant genes

ind<-which(gama==1)

pr<-exp(x%%beta)/(1+exp(x%%beta))
y<-rbinom(n,1,pr)

data<-cbind(y,x)
nn<-ncol(x)
colnames(data) <- c(“Y”,paste0(‘X’,1:(dim(x)[2])))
return(list(ind=ind,data=data,beta=beta))
}
##########Simulation truth#################################################################################
p<-100
n=400
s<-0.2
m1<-100
s1<-1
m2<-0
s2<-0.3

set.seed(05042019)
data_m<-sim(p=p,n=n,s=s,m1=m1,s1=s1,m2=m2,s2=s2)
Howeve, it is the same case.

With n=20 << p=100, it is likely that the posterior is multimodal with narrow spike centered at zero. Here’s an example figure how your marginals probably look like

mean and median are away from zero, but there is also 20% of draws larger than 0. This just due to data being weakly informative in n<<p case.

Note that you need to use lot of bins in your histogram to see how narrow the spike is (and kernel density estimates are likely to smooth a lot)

Can you guide me on a simulation setting for which this method will work?

What do you mean with “this method” and “will work”? The links I provided are pointing to simulations in which regularized horseshoe and posterior inference works.