Adjust prior given an omitted variable

Assume I have four variables y, x1, x2, x3. I know that x1…3 influence y. I also know that x1 and x2 are positively correlated. For some reason I cannot observe x1 anymore. This means my estimate for c will be too high.

y ~ normal(a + c * x2 + d * x3, sigma)

I can lower the prior (mean) for c, I can also increase the prior for sigma (since x1 is now part of the error) - but both don’t correct the bias in c (very much).

Can I model some latent variable x1hat (function of x2, given assumed correlation) that can reduce c?

Makes sense to me, especially if you’re including both data where x1 is present and data where it’s absent. Then it’s just missing data imputation.

Thanks Mike, do you have a suggestion how to model that, let’s assume corr(x1, x2) = .5

Keep in mind I don’t have x1.

I played around with something similar to this a few years back. Here is the stan code I used (this is about 4 years old, so some syntax may need to be updated and it could probably be improved):

data {
  int<lower=0> n;
  vector[n] y;
  vector[n] x;
  real<lower= -1, upper= 1> rho;
  real g;
transformed data {
  matrix[2,2] sigx;
  vector[2] mux;
  sigx[1,1] <- variance(x); sigx[1,2] <- sd(x)*rho;
  sigx[2,1] <- sigx[1,2]; sigx[2,2] <- 1.0;
  mux[1] <- mean(x);
  mux[2] <- 0.0;
parameters {
  real b0;
  real b1;
  real<lower=0> sig;
  vector[n] u;
transformed parameters {
  vector[2] xx[n];
  for(i in 1:n) {
    xx[i,1] <- x[i];
    xx[i,2] <- u[i];
model {
  real eta[n];
  for(i in 1:n) {
    eta[i] <- b0 + b1*xx[i][1] + g*xx[i][2];
  xx ~ multi_normal(mux, sigx);
  y ~ normal( eta, sig);

x is the observed predictor (x2 in your example, x1 in my thoughts, I don’t have an x3, but you could add it).

rho is the correlation between x1 and x2
g is the slope on the unobserved variable

I believe that to get anything meaningful you need to specify rho and g exactly. If you put a prior on either (or both) or worse, use the default flat prior, then your posterior will have a flat or nearly flat section in a subspace of the posterior around the mode (problems with identifiability) which will lead to not very useful results.

The variable u is just the unobserved x variable (x1 for your case) which I force to have mean 0 and standard deviation 1 (you could change this, but if u is unobserved, then all transformations of u, including the standardization, are also unobserved).

The transformed data section just creates the variance matrix for the relationship between the x variables and the mean vector of the x variables.

The transformed parameters section creates a matrix (xx, actually an array of vectors, which may be able to be improved) of the x variables, the first “column” is the observed x variable (x2) and the second column is the unobserved predictor (x1, u, a parameter in Stan syntax).

The model then uses a multi_normal prior/likelihood on the xx matrix (to bring in the information about the correlation between x1 and x2) and a normal likelihood for y based on the regression model.

The code should really have priors on b0, b1, and sig and the likelihood with y could probably be improved to use matrix multiplication/vectorization.

The simulations that I tried looked promising and I have been meaning to come back and explore this idea some more (my thought is to add something along these lines to the obsSens R package). Please let me know if this works for you, and/or what improvements you make.

Hope this helps get you started.

Thanks Greg, this looks great.

I have tested it and it works. I added my x3 to it as well. I am still thinking whether I can put a tight prior on g.