Using two models to estimate the missing data in an independent variable

JinghongZeng · August 10, 2022, 9:27am

Hi,

The response variable is y. The independent variable x has both observations x_{obs} and missing data x_{mis} . I am wondering if I can use two models that involve the same missing data x_{mis}. The first model is the distribution of the independent variable itself. The second model is a linear model y \sim x. Below is the example code.

data {
  int<lower=0> N_obs;
  int<lower=0> N_mis;
  vector[N_obs] y_obs1;
  vector[N_obs] x_obs;
  vector[N_mis] y_obs2;
}
parameters {
  real mu;
  real<lower=0> sigma1;
  vector[N_mis] x_mis;

  vector[2] b;
  real<lower=0> sigma2;
}
model {
  mu ~ N(0, 1); 
  sigma1 ~ N(0, 1) T[0, ];
  x_mis ~ N(0, 1);
  x_obs ~ N(mu, sigma1);
  x_mis ~ N(mu, sigma1); // first model for x_mis

  b ~ N(0, 1);
  sigma2 ~ N(0, 1) T[0, ];

  for(n in 1: N_obs) {
     y_obs1[n] ~ N(b[1]+b[2]*x_obs[n], sigma2);
  }
 
  for(n in 1: N_mis) {
     y_obs2[n] ~ N(b[1]+b[2]*x_mis[n], sigma2); // second model for x_mis
  }
}

The model can be fitted. But I don’t understand why Stan can use two models to estimate the same missing data. And I’m also wondering if the model is valid. Any advice?

Bob_Carpenter · August 10, 2022, 8:24pm

This isn’t a well-formed Stan model. Stan uses normal for normal distributions.

I think in the above model you do not want x_mis ~ normal(0, 1). If the data distribution is normal(mu, sigma1), then you just want x_mis ~ normal(mu, sigma1).

By adding two priors, you get the product in Stan. So your version looks like this:

p(x_mis) =propto= normal(x_mis | 0, 1) * normal(x_mis | mu, sigma1).

So it’s well-formed in Stan, but just not what you want for missing data imputation.

Also, you don’t need the truncation on sigma2 sampling statement because the parameters are constant and it’ll just add a constant.

You can always simulate data where you know the answer and see how it works to then get rid of some of it and try to impute it.

JinghongZeng · August 11, 2022, 1:55am

Thanks Bob for your helpful answers. I’ll check the missing data issue further.

JinghongZeng · August 11, 2022, 7:09am

Now got a clear idea.

It’s extremely helpful that x_mis ~ normal(mu, sigma1) can be regarded as a prior. I tried one parameter with only its prior given, then the model was fitted and looked like generating data.

I agree that truncation is not necessary. I feel that the truncation distribution is only needed when the bounds can produce 0 log probability.

Topic		Replies	Views
Combining data from multiple sources to model the same parameter Modeling	1	1029	November 29, 2017
Missing data in quadratic regression Modeling	5	344	December 17, 2023
Missing response model (section 10.3 of Stan manual) Modeling	11	2449	May 24, 2017
Merging Missing and Observed Data in Regression Model Modeling	15	916	January 22, 2020
Missing data in Stan - some difficulties understanding Modeling	6	575	August 16, 2021

Using two models to estimate the missing data in an independent variable

Related topics