 # Missing Data Imputation for Multivariate Outcome

Hi folks,

I need help performing missing data imputation in the case of multivariate outcome regression. I have come across some documentation (and McElreath’s Rethinking covers basic cases for missing data), but I can’t seem to figure out how to apply it to the multivariate case.

I am running a multivariate outcome regression model as described in Section 1.15 of the Stan User’s Guide. My data is survey responses from N participants. I have K covariates and am using these to predict L outcomes that I believe are correlated. I am using the non-centered parameterization. See code below.

``````data {
int<lower=1> N; // number of observations
int<lower=1> K; // number of covariates
int<lower=1> L; // number of outcomes
vector[K] X[N]; // model matrix
vector[L] y[N]; // outcome matrix
}

parameters {
matrix[L, K] beta;
cholesky_factor_corr[L] Omega;
vector<lower=0>[L] sigma;
}

transformed parameters {
matrix[L, L] L_Sigma;
L_Sigma = diag_pre_multiply(sigma, Omega);
}

model {
vector[L] mu[N];

to_vector(beta) ~ normal(0, 5);
sigma ~ exponential(1);
Omega ~ lkj_corr_cholesky(4);

for(i in 1:N) {
mu[i] = beta * X[i];
}

y ~ multi_normal_cholesky(mu, L_Sigma);
}
``````

McElreath provides this sample code in the 2nd ed of Statistical Rethinking for imputing missing data. It’s in the form of a function block at the top of the code.

``````functions{

vector merge_missing( int[] miss_indexes , vector x_obs , vector x_miss ) {
int N = dims(x_obs);
int N_miss = dims(x_miss);
vector[N] merged;
merged = x_obs;
for ( i in 1:N_miss )
merged[ miss_indexes[i] ] = x_miss[i];
return merged;
}
}
``````

In the model block, he uses this function to merge the imputed values with the observed values. I’ve tried adapting this to the case where X is a matrix, but with no success.

My R code for the data is as follows:

``````stan_dat1.1 <- list(
N = nrow(dat1),
K = ncol(model_matrix1),
L = ncol(outcome_matrix1),
X = model_matrix1, #matrix object including an intercept term
y = outcome_matrix1 #matrix object where each column is an outcome
)
``````

I want to account for all possible cases of missingness. In other words, missingness could be in the Xs and/or the ys.

Please note that I would prefer to keep my code as general as possible - such that I supply a matrix X of covariates instead of specifying individual covariates. This is because I have multiple models of this form and would prefer to reuse the same Stan code.

Thanks!

Mostly here to bump this up to the top so folks can see it. Does the non-multivariate in McElreath work for you? I would start there get that up and running.

Thanks @Ara_Winter for boosting this.

I managed to find a solution with help from this video: https://www.youtube.com/watch?v=plZ_m-21CYA&t=2122s&ab_channel=MikeLawrence

Here’s my Stan code now:

``````data {
int<lower=1> N; // number of observations
int<lower=1> K; // number of covariates
int<lower=1> L; // number of outcomes
int<lower=1> N_miss_x; // number of missings covariate(s)
int<lower=1> N_miss_y; // number of missings in outcome(s)
vector[K] X[N]; // model matrix (+ intercept)
vector[L] y[N]; // outcome matrix
int miss_indx[N_miss_x, 2]; // matrix of missing indices covariates
int miss_indy[N_miss_y, 2]; // matrix of missing indices for outcomes
}

parameters {
matrix[L, K] beta;
vector[N_miss_x] imputed_x;
vector[N_miss_y] imputed_y;
cholesky_factor_corr[L] Omega;
vector<lower=0>[L] sigma;
}

transformed parameters {
matrix[L, L] L_Sigma;
vector[K] X_with_imputed[N];
vector[L] y_with_imputed[N];

L_Sigma = diag_pre_multiply(sigma, Omega);
// Mike Lawrence's solution
X_with_imputed = X;
for(i in 1:N_miss_x) {
X_with_imputed[miss_indx[i, 1], miss_indx[i, 2]] = imputed_x[i];
}

y_with_imputed = y;
for(i in 1:N_miss_y) {
y_with_imputed[miss_indy[i, 1], miss_indy[i, 2]] = imputed_y[i];
}
}

model {
vector[L] mu[N];

to_vector(beta) ~ normal(0, 5);
imputed_x ~ normal(0, 1);
imputed_y ~ normal(0, 1);
sigma ~ exponential(1);
Omega ~ lkj_corr_cholesky(4);

for(i in 1:N) {
mu[i] = beta * X_with_imputed[i];
}
y_with_imputed ~ multi_normal_cholesky(mu, L_Sigma);
}
``````

And the data list computed in R:

``````# Find total missing and their location for X and y:
N_miss_x <- sum(is.na(model_matrix1))
N_miss_y <- sum(is.na(outcome_matrix1))
miss_indx <- which(is.na(model_matrix1), arr.ind = TRUE)
miss_indy <- which(is.na(outcome_matrix1), arr.ind = TRUE)

stan_dat <- list(
N = nrow(model_matrix1),
K = ncol(model_matrix1),
L = ncol(outcome_matrix1),
N_miss_x = N_miss_x,
N_miss_y = N_miss_y,
miss_indx = miss_indx,
miss_indy = miss_indy,
X = model_matrix1 %>%
mutate(across(everything(), ~ifelse(is.na(.), 999, .))) %>% # stan will not accept NAs so using 999 as placeholder.
as.matrix(),
y = outcome_matrix1 %>%
mutate(across(everything(), ~ifelse(is.na(.), 999, .))) %>%
as.matrix()
)
``````

Note that this only works for continuous X’s and y’s. For missing binary variables, there’s a solution here: https://gist.github.com/rmcelreath/9406643583a8c99304e459e644762f82.

For categorical missings, I found a solution here: Imputation of a 3 category covariate to model a binary outcome

Unfortunately, I found it quickly becomes unwieldy when you have multiple binary/categorical missing variables, and I have not come across a satisfactory solution (don’t even get me started on interaction terms!!). Maybe my dream of a general Stan model that deals with any number of missing variables is wishful thinking!

1 Like

Glad you you got it sorted. The missing data and interactions terms can be rough. Thanks for reporting out what you found. That will be super helpful for folks!

FYI, I developed a slightly different approach to missing data imputation that’s detailed here.

2 Likes

Thanks @mike-lawrence! I’ll have to take a look at it. Fyi, I’ve been finding your YouTube series helpful. Coming from a psych background myself it’s nice to see everything from a familiar perspective :)

1 Like