The difference between Bayesian and Frequentist misspecified models

Hi all,

I’m currently comparing predictive projection, Lasso (and some other variable selection methods) on a logistic regression. For this comparison, I use simulated data (including collinear data) and for most simulated data the predictive performance of predictive projection and Lasso is similar.

I also use a misspecified model, where the logistic model and the data generating process have different relations. The logistic model has linear relations while the data generating process uses a step function and an exponential relation. In this case predictive projection performs better than Lasso.

I once heard that Bayesian models are relatively better than frequentist models under misspecification, however I could never find any evidence to support this claim. Does anyone know whether this is true? If yes, why is this the case?

Isn’t neither the Lasso nor predictive projection full-Bayesian?

Aki the expert here… but you are correct. The predictive projection is also not full-Bayesian, but arguable a lot closer than lasso as. One full Bayesian approach would be Bayesian model averaging, for example.

Presumably, there is already a “full Bayesian” model from which the projective approach is, say a sparse regression. But to me it is an unfair comparison: Lasso itself is not an inference, it is a model, which can also be Bayesianlised even though it is not the optimal model to use. Lasso with a student-t likelihood can also be very robust.

It is hard for me to believe there can be a universal conclusion on the relative robustness between Bayesian vs non-Bayesian. Bayesian and non-Bayesian tend to use different models. Are we really saying horseshoe is more robust than lasso, or are we saying Bayesian is more robust than non-Bayesian?

1 Like

Unfortunately I don’t have much time to contribute to the discussion, but I’ll comment the one thing for which I have an answer ready

The projection predictive (notice the order of the words) is beyond full-Bayesian. It starts from the fullest Bayesian model possible. Then we use decision theory to answer the question how to make optimal Bayesian inference in the future if we observe only some of the covariates. Using decision theory doesn’t make it less Bayesian (especially if we follow Bernardo & Smith, who start the axioms from preferences tying utilities to be part of the Bayesian theory). There are some computational approximations to make it faster, and we could then discuss how much these approximations are allowed to affect the result compared to theoretical exact computation. I think that usually the biggest weaknesses of the approach are 1) that the first model wasn’t full enough and 2) the cost of covariates is not often included explicitly (but that is a utility choice made by the the user).

To be honest, I always trip up on that word ordering! :)

Just to add to it: after projection onto a submodel, we still have all posterior samples available for that submodel, so any fully Bayesian post-estimation analyses can still be performed. Nothing much really gets lost (well, computational approximations apart).

This could be handled by adding penalties when the variable selection is performed, or at least that’s where I’d start.

Now I see that I did not yet mention that I use the horseshoe prior yet, but I see you already picked up on that.

The projection itself does not really change the predictive performance (by design). So I think it might have to do with something else.

If you take a Bayesian view of the lasso regression, then you could see it as the MAP estimate of the logistic regression with a Laplace prior on the regression coefficient. The Laplace prior has lighter tails than the horseshoe prior and this might make the latter method more robust against outliers (O’Hagan 1979). Could this also result in robustness against (certain types of) misspecification?

On the other hand, I could also imagine that, when using a posterior distribution to make a prediction, you get (slightly) different results to using a point estimate (as with Lasso).

1 Like