How to sample with the 1-norm?

Dear All,

I am currently working on ridge regression, which can be interpreted using Bayesian statistics (DOI: 10.1016/j.electacta.2015.03.123). In particular, I know that the maximum-a-posteriori (MAP) estimate can be computed with its uncertainty. However, is it possible to evaluate the MAP with Stan when the 1-norm is used instead of the 2-norm, i.e., for lasso regression? Moreover, how do you sample with the 1-norm?


If you use a Laplace (double exponential) prior you can obtain a Bayesian equivalent framework to Lasso regression. The one dimensional Laplace distribution is specified by a mean, \mu, and a scale, \lambda such that the probability density function is given by:

Laplace(x|\mu, \lambda) = 1/(2\lambda) exp(-|x-\mu|/\lambda)

In stan you can use the sampling statement for the laplace prior:
x~double_exponential(mu, lambda)

Thank you so much for your reply @Garren_Hermanus! Could you please also provide more details about the code, especially the selection of the parameters of the Laplace distribution?

It seems like you just need more background regarding Bayesian modelling. I would recommend you read through “Pattern Recognition and Machine Learning” by Bishop. To induce sparsity we want to regularize the parameters towards zero, thus we chose \mu=0. For \lambda you can follow 1 of 2 approaches.

  1. Chose fixed values for \lambda and train your model based on these fixed values. To evaluate which of these is best suited you would need to evaluate the evidence term and use Bayes’ factors to determine which setting of \lambda is best suited for the model.
  2. Place an additional prior on \lambda. The choice of this depends on the users. You may chose to use a uninformative prior in terms of the uniform distribution P(\lambda)=1/\lambda_{max} where \lambda \in [0, \lambda_{max}] or one the exponential distribution such as the Gamma or inverse gamma distributions. As a starting point I would use the uniform prior on \lambda and train the model based on this. That is we will have the full bayesian model:
P(x,\lambda | \mathcal{D}) \propto P(\mathcal{D}|x)P(x|\lambda)P(\lambda)

Where \mathcal{D} is the data. The above model will give you a distribution across \lambda and x.

Suppose we have the model y=f(x)+\epsilon where f(x)=X\beta. We would have the model:

P(\beta|\lambda) = Laplace(\beta|0,\lambda)

Sample stan code for model is

model {
   lambda ~ uniform(0,lambda_max); // not really needed if lambda defined as real<lower=0, upper=lambda_max>
   beta ~ double_exponential(0,lambda);
   y ~ normal(X*beta, sigma); 

Consult the stan documentation for further details regarding writing stan code