Missing real predictors in multple regression

Ashwani_Jha · August 14, 2018, 2:42pm

Dear community,

I wanted to get to grips with missing data handling in Stan. The most common scenario for me is missing predictors in a multiple linear regression. After reading the manual, and assuming that the predictors are themselves all real and normally distributed, I wrote this example using Rstan:

XR is the design matrix;
Missing values are replaced by any number and are entered as an indexed list e.g. the 5 missing values here are:
$XR_miss_ind
[,1] [,2]
[1,] 2 1
[2,] 3 1
[3,] 5 2
[4,] 7 2
[5,] 2 3

/*
*Multiple linear regression example with missing predictors
*/

data {
  int N; //the number of observations
  int PR; //the number of real columns in the predictor matrix 
  real y[N]; //the response
  matrix[N,PR] XR; //the predictor matrix for reals
  int N_miss_R; // the number of missing reals
  int XR_miss_ind[N_miss_R,2]; // missing real indices
}

parameters {
  real alpha; // add intercept
  vector[PR] beta; //the regression parameters
  real<lower=0> sigma; //the standard deviation
  vector[N_miss_R] imputed_reals; // the imputed reals
  vector[PR] mu_XR;  // mean of XR
  vector[PR] sd_XR;   //sd of XR
}
transformed parameters {
matrix[N,PR] X_imputed = XR;   // imputed X
for ( i in 1:N_miss_R ) {	     X_imputed[XR_miss_ind[i,1],XR_miss_ind[i,2]]=imputed_reals[i];
		} // impute missing X reals
}

model {  
  alpha ~ cauchy(0,10); //prior for the intercept 
  sigma ~ cauchy(0, 0.25); // prior for variance

  for(i in 1:PR)
   beta[i] ~ normal(0,2.5);//prior for the slopes
  
  for(i in 1:PR)
  X_imputed[:,i] ~ normal(mu_XR[i], sd_XR[i]); // priors on XR for missing data

  y ~ normal(alpha + X_imputed*beta,sigma);
}

generated quantities {
vector[N] log_lik;
for (n in 1:N)
log_lik[n] = normal_lpdf(y[n] | alpha + X_imputed[n]*beta, sigma);
}

Result:

The model runs fine but with a few funny Rhats in X_imputed - I assume this is because I’m not actually sampling those values, just filling them in from the real data. Is this OK? Is this a valid missing data approach?

Many thanks for your help

bgoodri · August 14, 2018, 3:32pm

Things that are not random variables do not have meaningful Rhat, etc. I think what you wrote is valid under the MAR assumption, but the MAR assumption is usually dubious.

Ashwani_Jha · August 14, 2018, 4:00pm

Thanks!

So if the mechanism is Missing Not at Random, would one option be to also model the covariance matrix of x_imputed? Or the covariance of [y x_imputed]?

bgoodri · August 14, 2018, 4:39pm

With MNAR, you can model the probability of missingness, conditional on the imputed / observed values.

Topic		Replies	Views
Can't understand an example for handling missing value in rstan Modeling rstan , missing-data	1	830	June 26, 2022
Help with simple missing data in the predictor example Modeling rstan , specification	2	295	July 18, 2023
Missing response model (section 10.3 of Stan manual) Modeling	11	2448	May 24, 2017
Missing data in quadratic regression Modeling	5	344	December 17, 2023
Guidelines for Practical Imputation with Stan? Modeling cmdstan , rstan , techniques , specification , missing-data	4	1465	September 6, 2023

Missing real predictors in multple regression

Related topics