Thanks @Ara_Winter for boosting this.
I managed to find a solution with help from this video: https://www.youtube.com/watch?v=plZ_m-21CYA&t=2122s&ab_channel=MikeLawrence
Here’s my Stan code now:
data {
int<lower=1> N; // number of observations
int<lower=1> K; // number of covariates
int<lower=1> L; // number of outcomes
int<lower=1> N_miss_x; // number of missings covariate(s)
int<lower=1> N_miss_y; // number of missings in outcome(s)
vector[K] X[N]; // model matrix (+ intercept)
vector[L] y[N]; // outcome matrix
int miss_indx[N_miss_x, 2]; // matrix of missing indices covariates
int miss_indy[N_miss_y, 2]; // matrix of missing indices for outcomes
}
parameters {
matrix[L, K] beta;
vector[N_miss_x] imputed_x;
vector[N_miss_y] imputed_y;
cholesky_factor_corr[L] Omega;
vector<lower=0>[L] sigma;
}
transformed parameters {
matrix[L, L] L_Sigma;
vector[K] X_with_imputed[N];
vector[L] y_with_imputed[N];
L_Sigma = diag_pre_multiply(sigma, Omega);
// Mike Lawrence's solution
X_with_imputed = X;
for(i in 1:N_miss_x) {
X_with_imputed[miss_indx[i, 1], miss_indx[i, 2]] = imputed_x[i];
}
y_with_imputed = y;
for(i in 1:N_miss_y) {
y_with_imputed[miss_indy[i, 1], miss_indy[i, 2]] = imputed_y[i];
}
}
model {
vector[L] mu[N];
to_vector(beta) ~ normal(0, 5);
imputed_x ~ normal(0, 1);
imputed_y ~ normal(0, 1);
sigma ~ exponential(1);
Omega ~ lkj_corr_cholesky(4);
for(i in 1:N) {
mu[i] = beta * X_with_imputed[i];
}
y_with_imputed ~ multi_normal_cholesky(mu, L_Sigma);
}
And the data list computed in R:
# Find total missing and their location for X and y:
N_miss_x <- sum(is.na(model_matrix1))
N_miss_y <- sum(is.na(outcome_matrix1))
miss_indx <- which(is.na(model_matrix1), arr.ind = TRUE)
miss_indy <- which(is.na(outcome_matrix1), arr.ind = TRUE)
stan_dat <- list(
N = nrow(model_matrix1),
K = ncol(model_matrix1),
L = ncol(outcome_matrix1),
N_miss_x = N_miss_x,
N_miss_y = N_miss_y,
miss_indx = miss_indx,
miss_indy = miss_indy,
X = model_matrix1 %>%
mutate(across(everything(), ~ifelse(is.na(.), 999, .))) %>% # stan will not accept NAs so using 999 as placeholder.
as.matrix(),
y = outcome_matrix1 %>%
mutate(across(everything(), ~ifelse(is.na(.), 999, .))) %>%
as.matrix()
)
Note that this only works for continuous X’s and y’s. For missing binary variables, there’s a solution here: https://gist.github.com/rmcelreath/9406643583a8c99304e459e644762f82.
For categorical missings, I found a solution here: Imputation of a 3 category covariate to model a binary outcome
Unfortunately, I found it quickly becomes unwieldy when you have multiple binary/categorical missing variables, and I have not come across a satisfactory solution (don’t even get me started on interaction terms!!). Maybe my dream of a general Stan model that deals with any number of missing variables is wishful thinking!