I find that data reduction (unsupervised learning) is a valuable part of my modeling strategy when there are too many predictors for the effective sample size available. I also like penalized regression but tools like principal components, sparse principal components, and variable clustering also have roles.

Does anyone know of a Bayesian modeling approach that provides a one-step approach to restricting a model to emphasize orthogonal collapsed covariate dimensions as principal components regression does? I would imagine that scaling issues are tough to deal with in this context.

1 Like

Is this specifically for regression problems? Do you assume that all of your covariates are observed? Because then you could compute any standard dimensionality reduction scheme to the design matrix prior to applying sampling to the regression weights. You can also model the design matrix with a probabilistic PCA or similar probabilistic factor model but then you would certainly have scaling and non-identifiability issues.

This is a standard regression problem, and I realize that I can easily run a two-step solution. But I would like to have a unified Bayesian framework with one step. Ideally there would be a standard raw-variable model specification but a prior that favors emphasizing orthogonal projections of the raw covariates.

Hmm, the usual advice for Bayesian regression in Stan is to do a two-step approach and apply a prior to the orthogonal coordinates, but if you want a prior on the raw coordinates you can map it back. As per the Stan manual, we convert the covariates X to an orthogonal basis via a QR transform, i.e. X=QR. Instead of putting a prior on the regression weights \beta, we instead define a new set of regression weights \gamma which are applied to the orthogonal basis instead, i.e. \hat{y}=Q\gamma. \gamma is then related to the original regression weights by \gamma=R\beta. If we impose a Gaussian prior N(\mu,\Sigma) on the orthogonal parameters \gamma, then the corresponding prior on the raw weights would be N(R^{-1}\mu,R^{-1}\Sigma R^{-T}). By tuning \mu and \Sigma you can change your assumptions on the orthogonal weights. There is really no reason not to do the computation using \gamma and then map it back to \beta post hoc, though (I think).

1 Like

If you know the dimension of the collapsed space then you covert PCA into a generative model â€“ see Section 12.2 of Bishopâ€™s â€śPattern Recognition and Machine Learningâ€ť. One immediately challenge, however, is that the matrix that transforms from the collapsed latent space to the full observed space lives in a Stiefel manifold which canâ€™t be specified by independent, real parameters. Consequently we cannot currently fit that matrix as hyperparameters at the moment unless you can place additional constraints on it.

If you want to learn the dimension as well then you dealing with a trans-dimensional model comparison problem which I would argue is very unlikely to be accurately fittable with existing technologies.

2 Likes

Thanks very much for both of your very perceptive and helpful comments. I have a lot to learn. It would be nice to be able to have fairly general directed penalties. We are working on one area: cut out separate propensity score models for observational treatment comparisons and instead handle the case where there are too many adjustment covariates by penalizing their effects according to their propensity coefficients, which are just adjusted distance metrics. Covariates that are more balanced across treatment groups would be penalized more.