Hi all,
In the manual 2.17.0, page 180, section 11.1. Missing Data, there is a code:
data {
int<lower=0> N_obs;
int<lower=0> N_mis;
real y_obs[N_obs];
}
parameters {
real mu;
real<lower=0> sigma;
real y_mis[N_mis];
}
model {
y_obs ~ normal(mu, sigma);
y_mis ~ normal(mu, sigma);
}
What is the role of y_mis ~ normal(mu, sigma);
? I am thinking that statement is just a prior for y_mis but it does not contribute to the likelihood (I mean not the posterior density).
So what happens conceptually if I remove that line from the Stan code?
Thanks for reading my question!
Trung Dung.
If you move that and the declaration of y_mis
, then you should get the same posterior for mu
and sigma
. So this model would be more efficient implementing y_mis
in the generated quantities block.
However, some missing data problems do affect the log density, so they can’t be removed. If the log density is the same up to a proportion with the missing data marginalized out, then it won’t affect the posterior.
Thanks Bob,
So I understand that when missingness is ignorable (MAR for example) then we do not need y_mis and its declaration.
When missingness is non-ignorable, we need to deal with y_mis otherwise we will get biased results?
Do I understand you correctly?
Kind regards,
Trung Dung.
If I am correct in the reply above then I think that the following codes is correct.
In R I create a missing indicator misIn, receiving 1 if y is missing and 0 otherwise.
In Stan code I write
for (i in 1 : N) {
if (misIn[i] = 0) {y[i] ~ N(mu, sigma)};
}
What do you think, @Bob_Carpenter, that this is a correct code, and equivalent to your suggestion of
If you move that and the declaration of y_mis, then you should get the same posterior for mu and sigma.
Thank you for your time!
Trung Dung.
I’d think you probably just want to write
y ~ normal(mu, sigma);
It’s not N
and you need the observed versions to help estimate mu
and sigma
for the missing versions.