I’m a little confused by something in the missing data documentation.

In Section 3.1, the missing data is being generated as parameters. Section 3.5 is similar in many ways to Section 3.1, but extends it to a multivariate case where some of each variable is missing. However, Section 3.5 does not seem to use the same approach as Section 3.1 in terms of generating missing data as parameters. If Section 3.5 were set up the same as Section 3.1, then the initial for loop presented would need to be adjusted to something like below

where mu_2_given_1, B_2_given_1, sigma_2_given1 and such as given by the conditional multivariate normal distribution.

This would generate the missing y conditional on the available y. It seems as if this approach is not taken. Footnote two in Section 3.5 mentions that in a multivariate regression problem with missing predictors, then the missing data would need to be generated as parameters, but I’m still not entirely sure why that is not needed here. Section 3.1 doesn’t have any predictors and still represents the left-hand side missing data as parameters.

In section 3.5 the missing values have been “integrated out”. Section 3.1 doesn’t need to represent the missing data either. If 3.5 were written the same way as section 3.1 it would look like this

data {
int N;
vector[2] y[N];
int<lower=0, upper=1> y_observed[N, 2];
}
parameters {
vector[2] mu;
cov_matrix[2] Sigma;
vector[N-sum(y_observed[,1])] missing_1;
vector[N-sum(y_observed[,2])] missing_2;
}
transformed parameters {
vector[2] y_complete[N];
{
int k1 = 1;
int k2 = 1;
for (i in 1:N) {
if (y_observed[i,1]) {
y_complete[i,1] = y[i,1];
} else {
y_complete[i,1] = missing_1[k1];
k1 += 1;
}
if (y_observed[i,2]) {
y_complete[i,2] = y[i,2];
} else {
y_complete[i,2] = missing_2[k2];
k2 += 1;
}
}
}
}
model {
y_complete ~ multi_normal(mu, Sigma);
}

In what sense does Section 3.1 not need to represent the missing data? For instance, like if I estimate it with the ratio of observed to missing data in wildly different amounts (like 0.01, 0.1, 1, 10, 100), should the estimated parameters be roughly the same in every case? What about the sum of the log-likelihood?

Also, my sense was that what you are suggesting is equivalent to what I was suggesting, though I did not create a full example. The difference is that you allow Stan to estimate missing_1 and missing_2, while I more directly get their distributions from the conditional multivariate normal (extracted from the mu and Sigma parameters). I think that’s the same thing, but I can’t say for sure.

The model in section 3.1 is so simple it can be solved analytically. The posterior is centered around the mean of the observed data and it’s variance is proportional to the variance of the observed data. The missing data can be imputed simply by drawing new predictions from the posterior.

The situation described in the footnote is when your models includes statements like y ~ normal(a+b*x, s) where x is (partially missing) data. Then you cannot just draw a posterior prediction, the model must estimate x directly.

I’ll try to clarify in the doc. I was writing all that as I was learning how to deal with missing data BUGS-style. Here’s the issue and if someone else wants to update the doc, I’d be happy to review and merge, but I assigned it to myself so I won’t forget to do it:

I have a quick question regarding missing data handling in Stan. I have read the respective chapters in the user guide, but I am still unsure how to replicate the handling of missing data of Bugs in Stan. I am aware that Stan cannot handle NA values, and that you have to model them explicitly. I am currently in the process of converting Bugs code for a Bayesian meta-analytic SEM approach outlined in Ke et al., 2019. In their paper they state that they used a data augmentation approach realized by Gibbs sampling. I guess that’s just the usual way Bugs handels missing data, is this correct? Could somebody please point me to an example, piece of code, or reference where the Bugs-handling of missing data is replicated in Stan?

@ckoenig, I think you’ll be better served by posting your question as a new topic (fewer people read old threads)

Also, adding formulas from the paper, direct link to the paper and the BUGS code could help people help you more effectively (my hunch is that “data augmentation approach realized by Gibbs sampling” could mean many different things, but I have little knowledge of the topic, so I might be wrong).

Hi Martin, thanks for your response! I didn’t want to open a new topic, since my question is not related to a specific model, it was more on a conceptual level. The model itself runs fine in Stan, and thanks to a video by Mike Lawrence the missing data handling works properly. :)