# Missing value imputation - which approach is better?

So, me and a colleague have to impute some data, x, given a categorical variable. We arrived at two different approaches:

a) as in the tutorial: split x into x_obs and x_mis, and treat x_mis as parameters. Something like this:

``````data {
int<lower=0> N_obs;
int<lower=0> N_mis;
int J; // number of package types
real<lower = 0.01> x_obs[N_obs];
int<lower = 1, upper = J> ptype_obs[N_obs];
int<lower = 1, upper = J> ptype_mis[N_mis];
}
parameters {
real<lower = 0.01> x_mis[N_mis];
matrix<lower = 0.01>[J] mu_x;
matrix<lower = 0.01, upper=5>[J] sigma_x;
}

transformed parameters {
vector<lower = 0.01>[N_obs] x_obs_mu;
vector<lower = 0.01>[N_obs] x_obs_sigma;
vector<lower = 0.01>[N_mis] x_mis_mu;
vector<lower = 0.01>[N_mis] x_mis_sigma;

for(i in 1:N_obs){
x_obs_mu[i]=mu_x[ptype_obs[i]];
x_obs_sigma[i]=sigma_x[ptype_obs[i]];
}
for(i in 1:N_mis){
x_mis_mu[i]=mu_x[ptype_mis[i]];
x_mis_sigma[i]=sigma_x[ptype_mis[i]];
}
}

model {
x_obs ~ normal(x_obs_mu,x_obs_sigma);
x_mis ~ normal(x_mis_mu,x_mis_sigma);
}
``````

b) saying "There is some real value, x_real, and both x_obs and x_mis tend around it. Here is sample code:

``````data {
int N;
int J;
real x[N];
real volume[N];
int<lower = 0, upper = 1> x_available[N];
int<lower = 1, upper = J> ptype[N];
}
parameters {
vector<lower = 0>[N] x_real;
vector<lower = 0.1>[J] mu_x_package;
vector<lower = 0.0>[J] sigma_x_package;
}

transformed parameters {
vector<lower = 0.01>[N] x_real_mu;
vector<lower = 0.01>[N] x_real_mu1;
vector<lower = 0.01>[N] x_real_sigma;
vector<lower = 0.01>[N] x_real_sigma1;

for(i in 1:N){

if(x_available[i]==1){
x_real_mu[i]=x[i];
x_real_sigma[i]=0.1;
}else{
x_real_mu[i]=mu_x_package[ptype[i]];
x_real_sigma[i]=sigma_x_package[ptype[i]];
}
x_real_mu1[i]=mu_x_package[ptype[i]];
x_real_sigma1[i]=sigma_x_package[ptype[i]];
}
}

model {
mu_x_package~normal(1.2,0.4);
sigma_x_package~normal(0.5,0.3);
x_real~normal(x_real_mu,x_real_sigma);
x_real_mu~normal(x_real_mu1,x_real_sigma1);
}
``````

Now, approach b) is very inelegant in that there is x_real_mu and x_real_mu1, and both are necessary for some reason, which I just can’t understand why. If I remove either one, the model converges quickly, but produces nonsensical results.

On the other hand, approach a) is somewhat redundant in that I have to duplicate most of my work.

Which approach is better and why?

What you label (1) is how to code the variables. What you’re calling (2) provides one model for the missing variables. Without a proper prior, as in (2), the missig values can produce improper posteriors, from which it is not possible (or even well defined) to sample.