Converting ElasticNet ML model to Stan

I have logistic regression ML model with originally about 40 correlating features and with ElasticNet I have reduced the amount of relevant features down to seven.

I am new in Bayesian modeling and I thought this project could be a good tech trial to study Stan. As I am new in Bayesian modeling, I lack of cut feeling on what to expect and I hope this community could give me some insight.

Some questions:

  1. I expected that the peak values of variable distributions I’m getting from Stan would be about same as the coefficients from ElasticNet logistic regression. However that is not the case. For some variables the the difference on distribution peak value and ML coefficient is about 2x. The distributions are not skewed, they are roughly normal. Is this “normal” behavior?

  2. can I use the value I get from ElasticNet as a prior? E.g. Normal distribution with mu = value from ElasticNet

  3. What is the best practice to do feature screening in Bayesian modeling? Shall I put all my 40 features into the model, or only the screened seven?

When you say ML, do you mean maximum likelihood?

ElasticNet can refer to any linear combination of L1 (lasso) and L2 (ridge) penalties. These have different Bayesian equivalents - typically, the Laplace and normal prior respectively. Can you clarify how you fit your ElasticNet model?

There are a few different viewpoints for Bayesian regularized regression. I believe the Tibshirani book on sparse statistical methods contains a summary of a few approaches. To answer your questions point by point:

  1. Most ElasticNet implementations standardize the inputs to the model. Are you following identical standardization procedures to the ElasticNet implementation you’re using? This could affect coefficient magnitude; also, assuming you set your priors in the usual way, this standardization of covariates is a requirement for the regularization to work properly.

  2. While technically feasible, I’m not sure this is the optimal approach in terms of correctness. Based on your mentioning the normal distribution, it seems you’re doing ridge regression, which I’m not as familiar with, so I’ll let someone else field this one. But, since you’re also talking about feature screening, though, I’m wondering if you intend to focus on the L1 penalty? In this case, the approaches I’m most familiar with are to either (a) set a hyperprior over the penalty parameter commonly called lambda or (b) a Monte Carlo EM approach with iterative updates of lambda. Either can be easily implemented in Stan.

  3. I think you’re looking for a basic regression model with Laplace priors on the coefficients; if this is the case, my recommendation for a full Bayesian treatment would be to feed all variables to the model.

Hope this helps! Happy to clarify anything as well.

1 Like

Thank you Walter for your useful comments.
I will try these!

///

ML = machine learning. This is good indication on how new I am in Bayesian domain.
I did not even think that ML could mean Maximum Likelihood :-)

The ElasticNet alpha == 0.8 so my model coefficient penalty term is alpha*L1 + (1-alpha)*L2, i.e. it is a mixture of 80% Lasso - 20% Ridge. It is not dropping variables away as aggressively as pure Lasso and I need those few additional variables for better interpretability.

I have standardized the inputs myself and the data is same for both ElasticNet & Stan. Also, in my understanding my ElasticNet library (glmnet) also standardizes the variables internally, but returns non-standardized coefficients

I think you may find Michael Betancourt’s case study on sparse regressions useful: https://betanalpha.github.io/assets/case_studies/bayes_sparse_regression.html

It doesn’t treat the Elastic Net directly but explains several shrinkage methods from a Bayesian perspective.

4 Likes

I would suggest adopting the regularized horseshoe prior, which is easily available in both brms and rstanarm. To sparsify your solution, you can then use projpred.

4 Likes

You can find videos, case studies and papers about regularised horseshoe and variable selection with projpred at https://avehtari.github.io/modelselection/ In this case the diabetes case study would be probably the closest to what you want.

2 Likes