Marginalising out missing data

Ben_Lambert · June 22, 2018, 5:41pm

I am wondering why the following approach to marginalise out missing data is wrong. I imagine that I have some data which is missing, although I know the upper possible value of each of the data points. A way to deal with this is to create parameters for each of the missing points,

data {
  int<lower=0> N_obs;
  int<lower=0> N_mis;
  real y_obs[N_obs];
  real uppers[N_mis];
}
parameters {
  real mu;
  real<lower=0> sigma;
  real<lower=0,upper=1> y_mis_raw[N_mis];
}
transformed parameters{
  real y_mis[N_mis];
  for(i in 1:N_mis)
    y_mis[i] = uppers[i] * y_mis_raw[i];
}
model {
  y_obs ~ normal(mu, sigma);
  y_mis ~ normal(mu, sigma);
}

I did, however, think that I could marginalise out each of the missing data points using something like,

data {
  int<lower=0> N_obs;
  int<lower=0> N_mis;
  real y_obs[N_obs];
  real uppers[N_mis];
}
parameters {
  real mu;
  real<lower=0> sigma;
}
model {
  y_obs ~ normal(mu, sigma);
  for(i in 1:N_mis)
    target += normal_lcdf(uppers[i]|mu,sigma);
}

This, however, yields different estimates of the parameters (mu,sigma). Can someone explain to me what is wrong with this latter approach?

bgoodri · June 22, 2018, 6:00pm

I think it will be the same if you truncate the normal density at uppers under the first approach. The second approach should be more efficient.

Ben_Lambert · June 22, 2018, 6:06pm

Ok, thanks. I’ve changed the first part of the code to be,

model {
  y_obs ~ normal(mu, sigma);
  for(i in 1:N_mis)
    y_mis[i] ~ normal(mu, sigma) T[,uppers[i]];
}

But I still seem to get a difference in estimates between both approaches. Is the above what you had in mind, Ben?

bgoodri · June 22, 2018, 6:20pm

Yeah, I think those correspond to the two approaches in the censored data part of the Stan manual.

Ben_Lambert · June 22, 2018, 7:30pm

Ok, I am still getting differences in the posteriors, even after I run 2000 iterations (and Rhat is 1.000 in all cases).

For the augmented (i.e. explicit parameters for each missing data point),

               mean se_mean   sd   2.5%    25%    50%    75%  97.5% n_eff Rhat
mu              0.31    0.01 0.24  -0.09   0.14   0.28   0.45   0.87  1886 1.00
sigma           0.83    0.01 0.23   0.51   0.67   0.79   0.94   1.39  1816 1.00

For the marginalised,

       mean se_mean   sd  2.5%   25%   50%   75% 97.5% n_eff Rhat
mu    -0.18    0.01 0.25 -0.70 -0.33 -0.17 -0.02  0.27  1474    1
sigma  0.77    0.01 0.20  0.48  0.62  0.73  0.87  1.25  1557    1
lp__  -1.96    0.03 1.09 -4.90 -2.35 -1.63 -1.21 -0.92  1178    1

Any ideas why there are such differences?

FYI, I am using the following R code to generate the data,

N_obs <- 10
N_mis <- 10
y_obs <- rnorm(N_obs)
uppers <- rnorm(N_mis,5,2)

bgoodri · June 22, 2018, 10:42pm

In the augmented case, you have a lower bound of zero, so in the marginalized case, you have to do a two-sided integral.

Topic		Replies	Views
Marginalize out missing continuous data Modeling techniques , specification	7	1154	February 25, 2023
Marginalizing in simplified discrete model Modeling	0	354	March 7, 2022
How are missing and censored multivariate data estimated in Stan? Algorithms	4	1441	March 21, 2018
Target with missing values Modeling	4	647	January 30, 2018
Marginalizing over latent discrete parameters Modeling	20	1281	November 11, 2021

Marginalising out missing data

Related topics