Hi folks,
I need help performing missing data imputation in the case of multivariate outcome regression. I have come across some documentation (and McElreath’s Rethinking covers basic cases for missing data), but I can’t seem to figure out how to apply it to the multivariate case.
I am running a multivariate outcome regression model as described in Section 1.15 of the Stan User’s Guide. My data is survey responses from N participants. I have K covariates and am using these to predict L outcomes that I believe are correlated. I am using the non-centered parameterization. See code below.
data {
int<lower=1> N; // number of observations
int<lower=1> K; // number of covariates
int<lower=1> L; // number of outcomes
vector[K] X[N]; // model matrix
vector[L] y[N]; // outcome matrix
}
parameters {
matrix[L, K] beta;
cholesky_factor_corr[L] Omega;
vector<lower=0>[L] sigma;
}
transformed parameters {
matrix[L, L] L_Sigma;
L_Sigma = diag_pre_multiply(sigma, Omega);
}
model {
vector[L] mu[N];
to_vector(beta) ~ normal(0, 5);
sigma ~ exponential(1);
Omega ~ lkj_corr_cholesky(4);
for(i in 1:N) {
mu[i] = beta * X[i];
}
y ~ multi_normal_cholesky(mu, L_Sigma);
}
McElreath provides this sample code in the 2nd ed of Statistical Rethinking for imputing missing data. It’s in the form of a function block at the top of the code.
functions{
vector merge_missing( int[] miss_indexes , vector x_obs , vector x_miss ) {
int N = dims(x_obs)[1];
int N_miss = dims(x_miss)[1];
vector[N] merged;
merged = x_obs;
for ( i in 1:N_miss )
merged[ miss_indexes[i] ] = x_miss[i];
return merged;
}
}
In the model block, he uses this function to merge the imputed values with the observed values. I’ve tried adapting this to the case where X is a matrix, but with no success.
My R code for the data is as follows:
stan_dat1.1 <- list(
N = nrow(dat1),
K = ncol(model_matrix1),
L = ncol(outcome_matrix1),
X = model_matrix1, #matrix object including an intercept term
y = outcome_matrix1 #matrix object where each column is an outcome
)
I want to account for all possible cases of missingness. In other words, missingness could be in the Xs and/or the ys.
Please note that I would prefer to keep my code as general as possible - such that I supply a matrix X of covariates instead of specifying individual covariates. This is because I have multiple models of this form and would prefer to reuse the same Stan code.
Thanks!