Missing value imputation - which approach is better?

hristo.buyukliev · October 26, 2017, 8:10am

So, me and a colleague have to impute some data, x, given a categorical variable. We arrived at two different approaches:

a) as in the tutorial: split x into x_obs and x_mis, and treat x_mis as parameters. Something like this:

data {
  int<lower=0> N_obs;
  int<lower=0> N_mis;
  int J; // number of package types
  real<lower = 0.01> x_obs[N_obs];
  int<lower = 1, upper = J> ptype_obs[N_obs];
  int<lower = 1, upper = J> ptype_mis[N_mis];
}
parameters {
  real<lower = 0.01> x_mis[N_mis];
  matrix<lower = 0.01>[J] mu_x;
  matrix<lower = 0.01, upper=5>[J] sigma_x;
}

transformed parameters {
  vector<lower = 0.01>[N_obs] x_obs_mu;
  vector<lower = 0.01>[N_obs] x_obs_sigma;
  vector<lower = 0.01>[N_mis] x_mis_mu;
  vector<lower = 0.01>[N_mis] x_mis_sigma;

  for(i in 1:N_obs){
    x_obs_mu[i]=mu_x[ptype_obs[i]];
    x_obs_sigma[i]=sigma_x[ptype_obs[i]];
  }
  for(i in 1:N_mis){
    x_mis_mu[i]=mu_x[ptype_mis[i]];
    x_mis_sigma[i]=sigma_x[ptype_mis[i]];
  }
}

model {
  x_obs ~ normal(x_obs_mu,x_obs_sigma);
  x_mis ~ normal(x_mis_mu,x_mis_sigma);
}

b) saying "There is some real value, x_real, and both x_obs and x_mis tend around it. Here is sample code:

data {
  int N;
  int J;
  real x[N];
  real volume[N];
  int<lower = 0, upper = 1> x_available[N];
  int<lower = 1, upper = J> ptype[N];
}
parameters {
  vector<lower = 0>[N] x_real;
  vector<lower = 0.1>[J] mu_x_package;
  vector<lower = 0.0>[J] sigma_x_package;
}

transformed parameters {
  vector<lower = 0.01>[N] x_real_mu;
  vector<lower = 0.01>[N] x_real_mu1;
  vector<lower = 0.01>[N] x_real_sigma;
  vector<lower = 0.01>[N] x_real_sigma1;
  
  for(i in 1:N){
    
   if(x_available[i]==1){
      x_real_mu[i]=x[i];
      x_real_sigma[i]=0.1;
    }else{
     x_real_mu[i]=mu_x_package[ptype[i]];
     x_real_sigma[i]=sigma_x_package[ptype[i]];
    }
   x_real_mu1[i]=mu_x_package[ptype[i]];
   x_real_sigma1[i]=sigma_x_package[ptype[i]];
  }
}

model {
  mu_x_package~normal(1.2,0.4);
  sigma_x_package~normal(0.5,0.3);
  x_real~normal(x_real_mu,x_real_sigma);
  x_real_mu~normal(x_real_mu1,x_real_sigma1);
}

Now, approach b) is very inelegant in that there is x_real_mu and x_real_mu1, and both are necessary for some reason, which I just can’t understand why. If I remove either one, the model converges quickly, but produces nonsensical results.

On the other hand, approach a) is somewhat redundant in that I have to duplicate most of my work.

Which approach is better and why?

Bob_Carpenter · November 2, 2017, 10:20pm

What you label (1) is how to code the variables. What you’re calling (2) provides one model for the missing variables. Without a proper prior, as in (2), the missig values can produce improper posteriors, from which it is not possible (or even well defined) to sample.

Topic		Replies	Views
Missing data imputation Modeling techniques	5	437	April 1, 2021
Marginalising out missing data Modeling	5	792	June 22, 2018
Missing categorical outcome data Modeling	2	359	June 30, 2020
Any big ideas on imputing count/integer variables? Modeling techniques	9	989	May 11, 2020
How to impose complex constraints when imputing missing data? Modeling	6	501	July 29, 2020

Missing value imputation - which approach is better?

Related Topics