Target with missing values


#1

Hi all,

In the manual 2.17.0, page 180, section 11.1. Missing Data, there is a code:

data {
int<lower=0> N_obs;
int<lower=0> N_mis;
real y_obs[N_obs];
}
parameters {
real mu;
real<lower=0> sigma;
real y_mis[N_mis];
}
model {
y_obs ~ normal(mu, sigma);
y_mis ~ normal(mu, sigma);
}

What is the role of y_mis ~ normal(mu, sigma);? I am thinking that statement is just a prior for y_mis but it does not contribute to the likelihood (I mean not the posterior density).

So what happens conceptually if I remove that line from the Stan code?

Thanks for reading my question!

Trung Dung.


#2

If you move that and the declaration of y_mis, then you should get the same posterior for mu and sigma. So this model would be more efficient implementing y_mis in the generated quantities block.

However, some missing data problems do affect the log density, so they can’t be removed. If the log density is the same up to a proportion with the missing data marginalized out, then it won’t affect the posterior.


#3

Thanks Bob,

So I understand that when missingness is ignorable (MAR for example) then we do not need y_mis and its declaration.

When missingness is non-ignorable, we need to deal with y_mis otherwise we will get biased results?

Do I understand you correctly?

Kind regards,
Trung Dung.


#4

If I am correct in the reply above then I think that the following codes is correct.

In R I create a missing indicator misIn, receiving 1 if y is missing and 0 otherwise.

In Stan code I write

for (i in 1 : N) {
    if (misIn[i] = 0) {y[i] ~ N(mu, sigma)};
}

What do you think, @Bob_Carpenter, that this is a correct code, and equivalent to your suggestion of

If you move that and the declaration of y_mis, then you should get the same posterior for mu and sigma.

Thank you for your time!
Trung Dung.


#5

I’d think you probably just want to write

y ~ normal(mu, sigma);

It’s not N and you need the observed versions to help estimate mu and sigma for the missing versions.