Missing Data Documentation

I’m a little confused by something in the missing data documentation.

In Section 3.1, the missing data is being generated as parameters. Section 3.5 is similar in many ways to Section 3.1, but extends it to a multivariate case where some of each variable is missing. However, Section 3.5 does not seem to use the same approach as Section 3.1 in terms of generating missing data as parameters. If Section 3.5 were set up the same as Section 3.1, then the initial for loop presented would need to be adjusted to something like below

for (n in 1:N) {
    if (y_observed[n, 1] && y_observed[n, 2]) {
        y[n] ~ multi_normal(mu, Sigma);
    } else if (y_observed[n, 1]) {
        y[n, 1] ~ normal(mu[1], sqrt(Sigma[1, 1]));
        y[n, 2] ~ normal(mu_2_given_1 + B_2_given_1 * y[n, 1], sigma_2_given_1));
    } else if (y_observed[n, 2]) {
        y[n, 1] ~ normal(mu_1_given_2 + B_1_given_2 * y[n, 2], sigma_1_given_2));
        y[n, 2] ~ normal(mu[2], sqrt(Sigma[2, 2]));
}
}

where mu_2_given_1, B_2_given_1, sigma_2_given1 and such as given by the conditional multivariate normal distribution.

This would generate the missing y conditional on the available y. It seems as if this approach is not taken. Footnote two in Section 3.5 mentions that in a multivariate regression problem with missing predictors, then the missing data would need to be generated as parameters, but I’m still not entirely sure why that is not needed here. Section 3.1 doesn’t have any predictors and still represents the left-hand side missing data as parameters.

Can anyone explain this to me?

In section 3.5 the missing values have been “integrated out”. Section 3.1 doesn’t need to represent the missing data either. If 3.5 were written the same way as section 3.1 it would look like this

data {
  int N;
  vector[2] y[N];
  int<lower=0, upper=1> y_observed[N, 2];
}
parameters {
  vector[2] mu;
  cov_matrix[2] Sigma;
  vector[N-sum(y_observed[,1])] missing_1;
  vector[N-sum(y_observed[,2])] missing_2;
}
transformed parameters {
  vector[2] y_complete[N];
  {
    int k1 = 1;
    int k2 = 1;
    for (i in 1:N) {
      if (y_observed[i,1]) {
        y_complete[i,1] = y[i,1];
      } else {
        y_complete[i,1] = missing_1[k1];
        k1 += 1;
      }
      if (y_observed[i,2]) {
        y_complete[i,2] = y[i,2];
      } else {
        y_complete[i,2] = missing_2[k2];
        k2 += 1;
      }
    }
  }
}
model {
  y_complete ~ multi_normal(mu, Sigma);
}
2 Likes

I appreciate the reply.

In what sense does Section 3.1 not need to represent the missing data? For instance, like if I estimate it with the ratio of observed to missing data in wildly different amounts (like 0.01, 0.1, 1, 10, 100), should the estimated parameters be roughly the same in every case? What about the sum of the log-likelihood?

Also, my sense was that what you are suggesting is equivalent to what I was suggesting, though I did not create a full example. The difference is that you allow Stan to estimate missing_1 and missing_2, while I more directly get their distributions from the conditional multivariate normal (extracted from the mu and Sigma parameters). I think that’s the same thing, but I can’t say for sure.

Yes it is the same thing.

The model in section 3.1 is so simple it can be solved analytically. The posterior is centered around the mean of the observed data and it’s variance is proportional to the variance of the observed data. The missing data can be imputed simply by drawing new predictions from the posterior.

The situation described in the footnote is when your models includes statements like y ~ normal(a+b*x, s) where x is (partially missing) data. Then you cannot just draw a posterior prediction, the model must estimate x directly.

3 Likes

Thanks!

I’ll try to clarify in the doc. I was writing all that as I was learning how to deal with missing data BUGS-style. Here’s the issue and if someone else wants to update the doc, I’d be happy to review and merge, but I assigned it to myself so I won’t forget to do it: