Missing Data Imputation for Multivariate Outcome

Will_Hipson · September 4, 2020, 3:57pm

Hi folks,

I need help performing missing data imputation in the case of multivariate outcome regression. I have come across some documentation (and McElreath’s Rethinking covers basic cases for missing data), but I can’t seem to figure out how to apply it to the multivariate case.

I am running a multivariate outcome regression model as described in Section 1.15 of the Stan User’s Guide. My data is survey responses from N participants. I have K covariates and am using these to predict L outcomes that I believe are correlated. I am using the non-centered parameterization. See code below.

data {
  int<lower=1> N; // number of observations
  int<lower=1> K; // number of covariates
  int<lower=1> L; // number of outcomes
  vector[K] X[N]; // model matrix
  vector[L] y[N]; // outcome matrix
}

parameters {
  matrix[L, K] beta;
  cholesky_factor_corr[L] Omega;
  vector<lower=0>[L] sigma;
}

transformed parameters {
  matrix[L, L] L_Sigma;
  L_Sigma = diag_pre_multiply(sigma, Omega);
}

model {
  vector[L] mu[N];
  
  to_vector(beta) ~ normal(0, 5);
  sigma ~ exponential(1);
  Omega ~ lkj_corr_cholesky(4);
  
  for(i in 1:N) {
    mu[i] = beta * X[i];
  }
  
  y ~ multi_normal_cholesky(mu, L_Sigma);
}

McElreath provides this sample code in the 2nd ed of Statistical Rethinking for imputing missing data. It’s in the form of a function block at the top of the code.

functions{

    vector merge_missing( int[] miss_indexes , vector x_obs , vector x_miss ) {
        int N = dims(x_obs)[1];
        int N_miss = dims(x_miss)[1];
        vector[N] merged;
        merged = x_obs;
        for ( i in 1:N_miss )
            merged[ miss_indexes[i] ] = x_miss[i];
        return merged;
    }
}

In the model block, he uses this function to merge the imputed values with the observed values. I’ve tried adapting this to the case where X is a matrix, but with no success.

My R code for the data is as follows:

stan_dat1.1 <- list(
  N = nrow(dat1),
  K = ncol(model_matrix1), 
  L = ncol(outcome_matrix1), 
  X = model_matrix1, #matrix object including an intercept term
  y = outcome_matrix1 #matrix object where each column is an outcome
)

I want to account for all possible cases of missingness. In other words, missingness could be in the Xs and/or the ys.

Please note that I would prefer to keep my code as general as possible - such that I supply a matrix X of covariates instead of specifying individual covariates. This is because I have multiple models of this form and would prefer to reuse the same Stan code.

Thanks!

Ara_Winter · September 17, 2020, 12:17am

Mostly here to bump this up to the top so folks can see it. Does the non-multivariate in McElreath work for you? I would start there get that up and running.

Will_Hipson · September 17, 2020, 1:19pm

Thanks @Ara_Winter for boosting this.

I managed to find a solution with help from this video: https://www.youtube.com/watch?v=plZ_m-21CYA&t=2122s&ab_channel=MikeLawrence

Here’s my Stan code now:

data {
  int<lower=1> N; // number of observations
  int<lower=1> K; // number of covariates
  int<lower=1> L; // number of outcomes
  int<lower=1> N_miss_x; // number of missings covariate(s)
  int<lower=1> N_miss_y; // number of missings in outcome(s)
  vector[K] X[N]; // model matrix (+ intercept)
  vector[L] y[N]; // outcome matrix
  int miss_indx[N_miss_x, 2]; // matrix of missing indices covariates
  int miss_indy[N_miss_y, 2]; // matrix of missing indices for outcomes
}

parameters {
  matrix[L, K] beta;
  vector[N_miss_x] imputed_x;
  vector[N_miss_y] imputed_y;
  cholesky_factor_corr[L] Omega;
  vector<lower=0>[L] sigma;
}

transformed parameters {
  matrix[L, L] L_Sigma;
  vector[K] X_with_imputed[N];
  vector[L] y_with_imputed[N];
  
  L_Sigma = diag_pre_multiply(sigma, Omega);
  // Mike Lawrence's solution
  X_with_imputed = X;
  for(i in 1:N_miss_x) {
    X_with_imputed[miss_indx[i, 1], miss_indx[i, 2]] = imputed_x[i];
  }
  
  y_with_imputed = y;
  for(i in 1:N_miss_y) {
    y_with_imputed[miss_indy[i, 1], miss_indy[i, 2]] = imputed_y[i];
  }
}

model {
  vector[L] mu[N];
  
  to_vector(beta) ~ normal(0, 5);
  imputed_x ~ normal(0, 1);
  imputed_y ~ normal(0, 1);
  sigma ~ exponential(1);
  Omega ~ lkj_corr_cholesky(4);
  
  for(i in 1:N) {
    mu[i] = beta * X_with_imputed[i];
  }
  y_with_imputed ~ multi_normal_cholesky(mu, L_Sigma);
}

And the data list computed in R:

# Find total missing and their location for X and y:
N_miss_x <- sum(is.na(model_matrix1))
N_miss_y <- sum(is.na(outcome_matrix1))
miss_indx <- which(is.na(model_matrix1), arr.ind = TRUE)
miss_indy <- which(is.na(outcome_matrix1), arr.ind = TRUE)

stan_dat <- list(
  N = nrow(model_matrix1),
  K = ncol(model_matrix1),
  L = ncol(outcome_matrix1),
  N_miss_x = N_miss_x,
  N_miss_y = N_miss_y,
  miss_indx = miss_indx,
  miss_indy = miss_indy,
  X = model_matrix1 %>%
    mutate(across(everything(), ~ifelse(is.na(.), 999, .))) %>% # stan will not accept NAs so using 999 as placeholder.
    as.matrix(),
  y = outcome_matrix1 %>%
    mutate(across(everything(), ~ifelse(is.na(.), 999, .))) %>%
    as.matrix()
)

Note that this only works for continuous X’s and y’s. For missing binary variables, there’s a solution here: https://gist.github.com/rmcelreath/9406643583a8c99304e459e644762f82.

For categorical missings, I found a solution here: Imputation of a 3 category covariate to model a binary outcome

Unfortunately, I found it quickly becomes unwieldy when you have multiple binary/categorical missing variables, and I have not come across a satisfactory solution (don’t even get me started on interaction terms!!). Maybe my dream of a general Stan model that deals with any number of missing variables is wishful thinking!

Ara_Winter · September 17, 2020, 2:51pm

Glad you you got it sorted. The missing data and interactions terms can be rough. Thanks for reporting out what you found. That will be super helpful for folks!

mike-lawrence · September 17, 2020, 5:26pm

FYI, I developed a slightly different approach to missing data imputation that’s detailed here.

Will_Hipson · September 17, 2020, 5:54pm

Thanks @mike-lawrence! I’ll have to take a look at it. Fyi, I’ve been finding your YouTube series helpful. Coming from a psych background myself it’s nice to see everything from a familiar perspective :)

Topic		Replies	Views
Guidelines for Practical Imputation with Stan? Modeling cmdstan , rstan , techniques , specification , missing-data	4	1449	September 6, 2023
Efficiency of multivariate models with partially missing data General techniques , specification , performance	3	468	April 24, 2023
Can't understand an example for handling missing value in rstan Modeling rstan , missing-data	1	824	June 26, 2022
Merging Missing and Observed Data in Regression Model Modeling	15	899	January 22, 2020
Data imputation/missing data in a correlation model Modeling	6	1321	December 28, 2018

Missing Data Imputation for Multivariate Outcome

Related topics