So, me and a colleague have to impute some data, x, given a categorical variable. We arrived at two different approaches:
a) as in the tutorial: split x into x_obs and x_mis, and treat x_mis as parameters. Something like this:
data {
int<lower=0> N_obs;
int<lower=0> N_mis;
int J; // number of package types
real<lower = 0.01> x_obs[N_obs];
int<lower = 1, upper = J> ptype_obs[N_obs];
int<lower = 1, upper = J> ptype_mis[N_mis];
}
parameters {
real<lower = 0.01> x_mis[N_mis];
matrix<lower = 0.01>[J] mu_x;
matrix<lower = 0.01, upper=5>[J] sigma_x;
}
transformed parameters {
vector<lower = 0.01>[N_obs] x_obs_mu;
vector<lower = 0.01>[N_obs] x_obs_sigma;
vector<lower = 0.01>[N_mis] x_mis_mu;
vector<lower = 0.01>[N_mis] x_mis_sigma;
for(i in 1:N_obs){
x_obs_mu[i]=mu_x[ptype_obs[i]];
x_obs_sigma[i]=sigma_x[ptype_obs[i]];
}
for(i in 1:N_mis){
x_mis_mu[i]=mu_x[ptype_mis[i]];
x_mis_sigma[i]=sigma_x[ptype_mis[i]];
}
}
model {
x_obs ~ normal(x_obs_mu,x_obs_sigma);
x_mis ~ normal(x_mis_mu,x_mis_sigma);
}
b) saying "There is some real value, x_real, and both x_obs and x_mis tend around it. Here is sample code:
data {
int N;
int J;
real x[N];
real volume[N];
int<lower = 0, upper = 1> x_available[N];
int<lower = 1, upper = J> ptype[N];
}
parameters {
vector<lower = 0>[N] x_real;
vector<lower = 0.1>[J] mu_x_package;
vector<lower = 0.0>[J] sigma_x_package;
}
transformed parameters {
vector<lower = 0.01>[N] x_real_mu;
vector<lower = 0.01>[N] x_real_mu1;
vector<lower = 0.01>[N] x_real_sigma;
vector<lower = 0.01>[N] x_real_sigma1;
for(i in 1:N){
if(x_available[i]==1){
x_real_mu[i]=x[i];
x_real_sigma[i]=0.1;
}else{
x_real_mu[i]=mu_x_package[ptype[i]];
x_real_sigma[i]=sigma_x_package[ptype[i]];
}
x_real_mu1[i]=mu_x_package[ptype[i]];
x_real_sigma1[i]=sigma_x_package[ptype[i]];
}
}
model {
mu_x_package~normal(1.2,0.4);
sigma_x_package~normal(0.5,0.3);
x_real~normal(x_real_mu,x_real_sigma);
x_real_mu~normal(x_real_mu1,x_real_sigma1);
}
Now, approach b) is very inelegant in that there is x_real_mu and x_real_mu1, and both are necessary for some reason, which I just can’t understand why. If I remove either one, the model converges quickly, but produces nonsensical results.
On the other hand, approach a) is somewhat redundant in that I have to duplicate most of my work.
Which approach is better and why?