Missing Data Imputation for Multivariate Outcome

Thanks @Ara_Winter for boosting this.

I managed to find a solution with help from this video: https://www.youtube.com/watch?v=plZ_m-21CYA&t=2122s&ab_channel=MikeLawrence

Here’s my Stan code now:

data {
  int<lower=1> N; // number of observations
  int<lower=1> K; // number of covariates
  int<lower=1> L; // number of outcomes
  int<lower=1> N_miss_x; // number of missings covariate(s)
  int<lower=1> N_miss_y; // number of missings in outcome(s)
  vector[K] X[N]; // model matrix (+ intercept)
  vector[L] y[N]; // outcome matrix
  int miss_indx[N_miss_x, 2]; // matrix of missing indices covariates
  int miss_indy[N_miss_y, 2]; // matrix of missing indices for outcomes
}

parameters {
  matrix[L, K] beta;
  vector[N_miss_x] imputed_x;
  vector[N_miss_y] imputed_y;
  cholesky_factor_corr[L] Omega;
  vector<lower=0>[L] sigma;
}

transformed parameters {
  matrix[L, L] L_Sigma;
  vector[K] X_with_imputed[N];
  vector[L] y_with_imputed[N];
  
  L_Sigma = diag_pre_multiply(sigma, Omega);
  // Mike Lawrence's solution
  X_with_imputed = X;
  for(i in 1:N_miss_x) {
    X_with_imputed[miss_indx[i, 1], miss_indx[i, 2]] = imputed_x[i];
  }
  
  y_with_imputed = y;
  for(i in 1:N_miss_y) {
    y_with_imputed[miss_indy[i, 1], miss_indy[i, 2]] = imputed_y[i];
  }
}

model {
  vector[L] mu[N];
  
  to_vector(beta) ~ normal(0, 5);
  imputed_x ~ normal(0, 1);
  imputed_y ~ normal(0, 1);
  sigma ~ exponential(1);
  Omega ~ lkj_corr_cholesky(4);
  
  for(i in 1:N) {
    mu[i] = beta * X_with_imputed[i];
  }
  y_with_imputed ~ multi_normal_cholesky(mu, L_Sigma);
}

And the data list computed in R:

# Find total missing and their location for X and y:
N_miss_x <- sum(is.na(model_matrix1))
N_miss_y <- sum(is.na(outcome_matrix1))
miss_indx <- which(is.na(model_matrix1), arr.ind = TRUE)
miss_indy <- which(is.na(outcome_matrix1), arr.ind = TRUE)

stan_dat <- list(
  N = nrow(model_matrix1),
  K = ncol(model_matrix1),
  L = ncol(outcome_matrix1),
  N_miss_x = N_miss_x,
  N_miss_y = N_miss_y,
  miss_indx = miss_indx,
  miss_indy = miss_indy,
  X = model_matrix1 %>%
    mutate(across(everything(), ~ifelse(is.na(.), 999, .))) %>% # stan will not accept NAs so using 999 as placeholder.
    as.matrix(),
  y = outcome_matrix1 %>%
    mutate(across(everything(), ~ifelse(is.na(.), 999, .))) %>%
    as.matrix()
)

Note that this only works for continuous X’s and y’s. For missing binary variables, there’s a solution here: https://gist.github.com/rmcelreath/9406643583a8c99304e459e644762f82.

For categorical missings, I found a solution here: Imputation of a 3 category covariate to model a binary outcome

Unfortunately, I found it quickly becomes unwieldy when you have multiple binary/categorical missing variables, and I have not come across a satisfactory solution (don’t even get me started on interaction terms!!). Maybe my dream of a general Stan model that deals with any number of missing variables is wishful thinking!

4 Likes