"Regress out" covariates that are indepent to the rest before a full Bayesian inference

Hi there,

May I ask a statistical question here? In the linear regression model, y = \beta_{1}x_{1} + \beta_{2}x_{2} \cdots \beta_{n}*x_{n} + \epsilon, we have N covariates x_{n}. If my domain knowledge tells me a subset of covariates x_{n=4 \cdots N,} are independent to the rest x_{n=1, 2, 3} (corresponding \beta_{n=1,2,3} are the parameter of interests). For a point estimation, one can

  1. first “regress out” x_{n=4 \cdots N,} from both y and x_{n=1, 2, 3}, using OLS

  2. then solve y_{reg} = \beta_{1}x_{reg1} + \beta_{2}x_{reg2} + \beta_{3}x_{reg3} + \epsilon.

I wonder if this process is recommended for Bayesian inference? For instance, we do step 1 with OLS, then full Bayesian for step 2. Since the parameters of interests are only \beta_{n=1,2,3}, it would save some computation time to not estimate posteriors of \beta_{n=4 \cdots N}, especially when N is large. I guess the MAP might be the same as if we solve everything once, y = \beta_{1}x_{1} + \beta_{2}x_{2} \cdots \beta_{n}*x_{n} + \epsilon, using full Bayesian. However, I am not sure if the estimated uncertainties of \beta_{n=1,2,3} are the same for these two processes. I guess the uncertainties of \beta_{n=1,2,3} in the 2 step procedure may only referring to the mode of \beta_{n=4 \cdots N}. Can we gain any computation convenience from the fact that x_{n=4 \cdots N,} are independent to x_{n=1, 2, 3}?

Thank you very much,
Michelle

1 Like

Of course!

Only if the problem is so large scale that it can’t be tackled with full Bayes. Ideally, we’d marginalize out the nuisance parameters rather than optimizing in that situation a la INLA (an R package for doing this). There’s also a more ad hoc technique called “empirical Bayes”, which often optimizes the marginalized distribution rather than sampling.

MAP isn’t Bayesian inference, per se—it’s a form of penalized maximum likelihood. Full Bayesian inference averages over posterior uncertainty. A typical parameter estimate is the posterior mean, which minimizes expected square error (if the model is well specified) by averaging over posterior uncertainty.

As another example, if we want to predict a new item y' given new predictors x'_{1:n} after observing some data y, x, the posterior predictive density is

\begin{eqnarray*} p(y' | y, x, x') & = & \mathbb{E}\!\left[p(y' \mid x', \beta) \mid x, y\right] \\[4pt] & = & \int p(y' \mid x', \beta) \cdot p(\beta \mid x, y) \, \textrm{d}\beta \\[4pt] & \approx & \frac{1}{M} \sum_{m=1}^M p(y' \mid \beta^{(m)}), \end{eqnarray*}

and we evaluate with MCMC using posterior draws \beta^{(m)} \sim p(\beta \mid x, y). We need the draws for all the parameters together. If we replaced \beta_{4:N} with constants, we’d systematically underestimate uncertainty.

3 Likes

Thank you very much @Bob_Carpenter, I guess the short conclusion would be - whether the covariates are independent to each other or not, we always estimate more accurately the uncertainty of the parameters by solving them all together because this allows the uncertainty to “propagate” across parameters.

Could you please elaborate a bit more on “marginalizing out” the nuisance parameters? I understand it like we replace the point estimation OLS in step 1 by a full Bayesian estimation (MCMC or variational), then we average over (marginalize?) the posteriors of nuisance parameters, which results in y^{\prime} and x^{\prime}_{n=1,2,3} – the input of step 2. So this process is sort of using posterior mean of nuisance parameters as constant parameters instead of point estimation, am I right? Then if posteriors of nuisance parameters are close enough to Gaussian, should not this process result in similar y^{\prime} and x^{\prime}_{n=1,2,3} as if we do point estimation?

This example really helps :) Thanks again.