Dear community,
I wanted to get to grips with missing data handling in Stan. The most common scenario for me is missing predictors in a multiple linear regression. After reading the manual, and assuming that the predictors are themselves all real and normally distributed, I wrote this example using Rstan:
XR is the design matrix;
Missing values are replaced by any number and are entered as an indexed list e.g. the 5 missing values here are:
$XR_miss_ind
[,1] [,2]
[1,] 2 1
[2,] 3 1
[3,] 5 2
[4,] 7 2
[5,] 2 3
/*
*Multiple linear regression example with missing predictors
*/
data {
int N; //the number of observations
int PR; //the number of real columns in the predictor matrix
real y[N]; //the response
matrix[N,PR] XR; //the predictor matrix for reals
int N_miss_R; // the number of missing reals
int XR_miss_ind[N_miss_R,2]; // missing real indices
}
parameters {
real alpha; // add intercept
vector[PR] beta; //the regression parameters
real<lower=0> sigma; //the standard deviation
vector[N_miss_R] imputed_reals; // the imputed reals
vector[PR] mu_XR; // mean of XR
vector[PR] sd_XR; //sd of XR
}
transformed parameters {
matrix[N,PR] X_imputed = XR; // imputed X
for ( i in 1:N_miss_R ) { X_imputed[XR_miss_ind[i,1],XR_miss_ind[i,2]]=imputed_reals[i];
} // impute missing X reals
}
model {
alpha ~ cauchy(0,10); //prior for the intercept
sigma ~ cauchy(0, 0.25); // prior for variance
for(i in 1:PR)
beta[i] ~ normal(0,2.5);//prior for the slopes
for(i in 1:PR)
X_imputed[:,i] ~ normal(mu_XR[i], sd_XR[i]); // priors on XR for missing data
y ~ normal(alpha + X_imputed*beta,sigma);
}
generated quantities {
vector[N] log_lik;
for (n in 1:N)
log_lik[n] = normal_lpdf(y[n] | alpha + X_imputed[n]*beta, sigma);
}
Result:
The model runs fine but with a few funny Rhats in X_imputed - I assume this is because I’m not actually sampling those values, just filling them in from the real data. Is this OK? Is this a valid missing data approach?
Many thanks for your help