Missing data imputation

Hello. I am trying to establish an intuition about missing values and imputations and I need some help formulating the problem.
So, if we assume we have an observation Y, which is generated as: Y=\beta_1 X_1+\beta_2 X_2+\epsilon. A very simple case, with say X_1,X_2\sim Uniform(2,5) and \epsilon\sim N(0,2). Say I have 100 observations. Now if I decide to make 10%, 20%,30% and 50% of the Y as missing. Now, couple of cases here, it can be missing at random or it could be informative missing, in that say I chose to make the missing happen from the top 10% of the data. My intuition here is if I use imputation, the estimates would be better than the case if I used only observed values. So instead of using the subset Y_{obs} if I use Y^{*} which has the missing values in Y imputed, the estimates will be better.

How can I set this simulation up in STAN? I realize there might be many options here, but this is just part of a bigger project and this would just be to illustrate the effect of imputation. So a simple simulation would do. Thanks in advance.

1 Like

A couple of questions here:

  1. What is fixed and what is a parameter? In the usual notation of linear regression, the betas are parameters and the X’s are data, but you’ve put a distribution on the X’s. Do you just mean to say that the fixed X’s are approximately uniformly distributed? Or do you mean that the X’s are parameters.
  2. Your “informative missing” scenario is complicated. It’s relatively simple if you are missing the Y’s for which the linear predictor is largest. But it’s more complicated if you are missing the largest y’s, because that will tend to truncate the observed distribution of \epsilon, which will lead to nastiness.

Unless you have some information about the missing Y’s (e.g. an informative prior), then you will not get better estimates of the regression parameters by imputing. Follow the information: imputing doesn’t provide more information about the system–it just uses the existing information to make guesses about the Y’s. Or to put it another way, if imputing could improve inference about the regression parameters, then any arbitrary regression could be “improved” by augmenting with a bunch of missing data points with arbitrary covariate values and imputing values of the regressand.

1 Like

Thank you for the reply. A couple of clarifications:

  1. X’s are to be treated fixed. Like in a simple linear regression the \beta's are the parameters. The Uniform distribution on the X s was just to generate the data. They are assumed known and fixed.
  2. For the “informative missing” part, yes I agree that is a messy scenario. I was able to design the case for missing at random. Now when I say informative missing, I mean say for example, my observations are in the range 0-5 and whatever \% is missing is from the range 0-3. Now it can happen that 10\% of 100 would be 10 and my observations might not have 10 data points in that range, in which the missing will be in the next group and so forth.

I guess am looking to either prove or disprove this intuition regarding the imputation.

There are two subtleties here:

  1. In the analysis scenario that you mean to simulate, does the analyst know a priori that the missing data are all in the range 0-3? This would allow the analyst to put an informative prior on the missing data, which could improve inference overall. On the other hand, if the missing data are a biased subset of the total data, but the analyst doesn’t know this a priori, then inference on the regression parameters gains nothing from the imputation.
  2. If missingness is directly causally related the magnitude of Y (and not related to the value of Y merely via the values of the covariates), then the analyst will need to figure out whether the residuals of the observed data nevertheless approximate an unbiased sample from the error distribution or not. In general, the higher r-squared is the safer it will be to neglect this issue (since with high r-squared the relative values of the Y’s will be determined primarily by the covariates and not so much by the errors).
1 Like

Yes, in the case that am thinking about, the analyst does know that the missing data are all in a particular range. I guess that falls under the censored data category and I should have used that from the get-go. And yes that would allow one to place an informative prior on the missing data. My team thinks that placing an informative prior will improve the analysis. I can see why that could be the case in a general statistical sense and I just wanted to create a quick example to establish that. But I think setting up that simulation is not as naive as I thought.

1 Like

Sorry but I’m not getting something.
If you are missing some the output variable Y, then really you are just creating a prediction of it using existing ones?
From my understanding imputation takes place if some values of X are missing, but you can impute them to improve your model of \beta