Missing real predictors in multple regression

Dear community,

I wanted to get to grips with missing data handling in Stan. The most common scenario for me is missing predictors in a multiple linear regression. After reading the manual, and assuming that the predictors are themselves all real and normally distributed, I wrote this example using Rstan:

XR is the design matrix;
Missing values are replaced by any number and are entered as an indexed list e.g. the 5 missing values here are:
$XR_miss_ind
[,1] [,2]
[1,] 2 1
[2,] 3 1
[3,] 5 2
[4,] 7 2
[5,] 2 3

/*
*Multiple linear regression example with missing predictors
*/

data {
  int N; //the number of observations
  int PR; //the number of real columns in the predictor matrix 
  real y[N]; //the response
  matrix[N,PR] XR; //the predictor matrix for reals
  int N_miss_R; // the number of missing reals
  int XR_miss_ind[N_miss_R,2]; // missing real indices
}

parameters {
  real alpha; // add intercept
  vector[PR] beta; //the regression parameters
  real<lower=0> sigma; //the standard deviation
  vector[N_miss_R] imputed_reals; // the imputed reals
  vector[PR] mu_XR;  // mean of XR
  vector[PR] sd_XR;   //sd of XR
}
transformed parameters {
matrix[N,PR] X_imputed = XR;   // imputed X
for ( i in 1:N_miss_R ) {	     X_imputed[XR_miss_ind[i,1],XR_miss_ind[i,2]]=imputed_reals[i];
		} // impute missing X reals
}

model {  
  alpha ~ cauchy(0,10); //prior for the intercept 
  sigma ~ cauchy(0, 0.25); // prior for variance

  for(i in 1:PR)
   beta[i] ~ normal(0,2.5);//prior for the slopes
  
  for(i in 1:PR)
  X_imputed[:,i] ~ normal(mu_XR[i], sd_XR[i]); // priors on XR for missing data

  y ~ normal(alpha + X_imputed*beta,sigma);
}

generated quantities {
vector[N] log_lik;
for (n in 1:N)
log_lik[n] = normal_lpdf(y[n] | alpha + X_imputed[n]*beta, sigma);
}

Result:
image

The model runs fine but with a few funny Rhats in X_imputed - I assume this is because I’m not actually sampling those values, just filling them in from the real data. Is this OK? Is this a valid missing data approach?

Many thanks for your help

Things that are not random variables do not have meaningful Rhat, etc. I think what you wrote is valid under the MAR assumption, but the MAR assumption is usually dubious.

Thanks!

So if the mechanism is Missing Not at Random, would one option be to also model the covariance matrix of x_imputed? Or the covariance of [y x_imputed]?

With MNAR, you can model the probability of missingness, conditional on the imputed / observed values.