I am wondering why the following approach to marginalise out missing data is wrong. I imagine that I have some data which is missing, although I know the upper possible value of each of the data points. A way to deal with this is to create parameters for each of the missing points,
data {
int<lower=0> N_obs;
int<lower=0> N_mis;
real y_obs[N_obs];
real uppers[N_mis];
}
parameters {
real mu;
real<lower=0> sigma;
real<lower=0,upper=1> y_mis_raw[N_mis];
}
transformed parameters{
real y_mis[N_mis];
for(i in 1:N_mis)
y_mis[i] = uppers[i] * y_mis_raw[i];
}
model {
y_obs ~ normal(mu, sigma);
y_mis ~ normal(mu, sigma);
}
I did, however, think that I could marginalise out each of the missing data points using something like,
data {
int<lower=0> N_obs;
int<lower=0> N_mis;
real y_obs[N_obs];
real uppers[N_mis];
}
parameters {
real mu;
real<lower=0> sigma;
}
model {
y_obs ~ normal(mu, sigma);
for(i in 1:N_mis)
target += normal_lcdf(uppers[i]|mu,sigma);
}
This, however, yields different estimates of the parameters (mu,sigma). Can someone explain to me what is wrong with this latter approach?
I think it will be the same if you truncate the normal density at uppers under the first approach. The second approach should be more efficient.
Ok, thanks. I’ve changed the first part of the code to be,
model {
y_obs ~ normal(mu, sigma);
for(i in 1:N_mis)
y_mis[i] ~ normal(mu, sigma) T[,uppers[i]];
}
But I still seem to get a difference in estimates between both approaches. Is the above what you had in mind, Ben?
Yeah, I think those correspond to the two approaches in the censored data part of the Stan manual.
Ok, I am still getting differences in the posteriors, even after I run 2000 iterations (and Rhat is 1.000 in all cases).
For the augmented (i.e. explicit parameters for each missing data point),
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
mu 0.31 0.01 0.24 -0.09 0.14 0.28 0.45 0.87 1886 1.00
sigma 0.83 0.01 0.23 0.51 0.67 0.79 0.94 1.39 1816 1.00
For the marginalised,
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
mu -0.18 0.01 0.25 -0.70 -0.33 -0.17 -0.02 0.27 1474 1
sigma 0.77 0.01 0.20 0.48 0.62 0.73 0.87 1.25 1557 1
lp__ -1.96 0.03 1.09 -4.90 -2.35 -1.63 -1.21 -0.92 1178 1
Any ideas why there are such differences?
FYI, I am using the following R code to generate the data,
N_obs <- 10
N_mis <- 10
y_obs <- rnorm(N_obs)
uppers <- rnorm(N_mis,5,2)
In the augmented case, you have a lower bound of zero, so in the marginalized case, you have to do a two-sided integral.