Multivariate normal data with Bayesian Lasso to do variable selection

Hi,

I am university student learning Bayesian statistics currently, and I am a first time user here. I would like to ask some support, please.

I intend to create a model for variable selection using Bayesian Lasso approach on the publicly available ‘wine quality data’. The data is attached, and it is also available by this link: https://archive.ics.uci.edu/ml/datasets/wine+quality.

I am using Metropolis-Hastings sampling implementing it with base R codes. I am aware there are other methods, but this is what I would like to use now.

My current aim is to have converging chains, and it is monitored by Gelman-Rubin statistic.

I used the following steps to reach convergences for all parameters:

1, I created the first variance-covariance (cov.prop) matrix with random values in all diagonal positions, and scaler (s=1) for this.

2, I run the sampler 3 times to get 1000 samples, and I checked the acceptance ratio in each cycle; I maintained the acceptance ratio by adjusting the ‘s’ scaling factor.

3, After the fourth run, I created a new variance-covariance matrix with the last 1000 samples.

4, I repeated step2 and step3 many times.

I expected the chains to converge; it happened with only the first 12 columns of ‘draws’ until reaching around 120000 samples then even those become unconverging again. I would like to find out why it is happening that (a) only the some parameters were converging, and (b) for what reasons even those became unconverging again.

The code I used:

  • for the very first cov matrix:

A<- rnorm(25, 0.1, 0.01) # updated since the first post
cov.prop <- diag(A)

  • for the updated cov matrix

int = c((step1-1000):step1)
M <- (draws[int, , 1] + draws[int, , 2] + draws[int, , 3] + draws[int, , 4])/4
cov.prop <- cov(M)

  • for sampling:

d <- winequality_red

d <- as.data.frame(d)
y <- d$quality
X <- d[,-12]
X <- matrix(unlist(X), ncol = dim(X)[2], byrow = FALSE)
X <- cbind(rep(1,nrow(X)),scale(X))

n <- length(y)
k <- ncol(X)
npars <- k + (k-1) + 2

logprior <- function(pars) {
b <- as.matrix(pars[1:k])
tau <- pars[(k+1):(2k-1)]
lambda2 <- pars[2
k]
sigma2 <- pars[npars]

if (lambda2 <= 0)
return(log(0))
if (sigma2 <= 0)
return(log(0))
if (any(tau<0))
return(log(0))
dens <- (sum(dnorm(abs(b[2:k]),0,sqrt(sigma2*tau),log=TRUE))+
dgamma(1/sigma2,4,60,log=TRUE) +
dgamma(lambda2,0.025,0.1,log=TRUE)+
sum(dexp(tau,lambda2/2,log=TRUE)))
return(dens)
}

loglikelihood <- function(y,X,par) {
b <- as.matrix(par[1:k])
sigma2 <- par[npars]
if (sigma2 <= 0)
return(log(0))
dens <- sum(dnorm(y,t(X%*%b),sqrt(sigma2),log=TRUE))
return(dens)
}

draws <- array(0, dim=c(200000, npars, 4))
draws[1,npars, 1:4] <- 1/rgamma(1,4,60)
draws[1,npars-1, 1:4] <- 1
draws[1,1, 1:4] <- mean(y)
draws[1,2:k, 1:4] <- rep(0,k-1)
draws[1,(k+1):(2*k-1), 1:4] <- rep(1,k-1)

converged <- c(1:npars)
converged[1:npars] <- 0
step1 <- 2
proposed <- c(1:npars)
proposed[1:npars] <- -1

accepted.prop <- rep(1, 4)

while (any(converged<1)) {

for (i in 1:4) {
proposed <- rmvnorm(1,draws[step1-1, , i], s*cov.prop)
r <- loglikelihood(y,X,proposed)+
logprior(proposed)-
loglikelihood(y,X,draws[step1-1, , i])-
logprior(draws[step1-1, , i])
u <- runif(1)
if (log(u) < r) {
draws[step1, , i] <- proposed
accepted.prop[i] <- accepted.prop[i]+1
} else {
draws[step1, , i] <- draws[step1-1, , i]
}
}
step1 <- step1 + 1
if (any(step1 == as.vector(seq(1000, 200000, by = 1000))) == TRUE) {
for ( z in 1:npars) {
chainlist <- mcmc.list(Chain1=mcmc(draws[1:(step1-1), z, 1]),
Chain2=mcmc(draws[1:(step1-1), z, 2]),
chain3=mcmc(draws[1:(step1-1), z, 3]),
chain4=mcmc(draws[1:(step1-1), z, 4]))
if ((gelman.diag(chainlist)$psrf[1,2]) < 1.4) {
converged[z] <- converged[z]+1

  }
}
print(converged)
print((accepted.prop-accepted.prop.print))
accepted.prop.print <- accepted.prop

}
}

Hi,

I haven’t implemented a lasso regression myself, so I can’t comment on specific of you model and/or data, but here’s some general comments that may help:

If you are using Metropolis-Hastings (MH) it means you are not using Stan, right? One of the ideas of Stan is allowing users to code the model but not worry about the details of the sampler. Writing your own sampler is great for learning – and MH is probably a good option to start compared to Hamiltonian Monte Carlo (HMC, which is what is implemented in Stan) – but if coding the sampler is your main goal I’d suggest also starting with a simple model where you know it works.

It sounds like you are implementing some sort of adaptive Metropolis, where proposals take into account correlations between parameters. This is sort of an improvement on “vanilla” MH, but if your goal instead is getting the whole thing to work it’s an outdated algorithm – again, it may be good practice, but maybe it’s not enough for your problem (I’d maybe start with zero covariance between parameters to being with, but since it’s just for practice making sure you implement MH may be the best first step).

HMC (which has at least a few variants) is considered the state-of-the-art sampler, so another option for you is to implement your model on Stan, which has R interfaces. You can check convergence with this method before trying it yourself using MH. Like I said Stan doesn’t have MH implemented, but if you don’t mind using a different language you may try Turing (Julia lang) or PyMC3 (Python) as a benchmark for your implementation.

It seems that you main goal is to gain understanding of Bayesian statistics and it’s algorithms, so I’d suggest not trying complicated things all at once and see what are the difficulties as you scale up your models or try to improve your sampler. Unless you have a (very) strong reason to code your own sampler, in practice you will probably use an implementation like the above for real applications.

Thank you very much! Yes, I am in the middle of a learning curve, and currently I would like to understand how MH sampling works ( or when it does not work then why )… Eventually my aim is to end up using Stan. Thanks for your thoughts.

1 Like