Simple selection model for MNAR data

Others have expressed interest in Bayesian models that account for data missing not at random (MNAR, where the probability of missing Y is dependent on the values of Y) both here on Discourse (sere here, here, here, and here) and in recent publications/blogs (see Du et al 2021 and here). The Heckman model solves this issue by modifying the likelihood for Y to account for the selection bias (the correlation between Y and the latent probability of observing Y) while Du and colleagues directly impute the latent probability of observing Y.

My (probably naïve) question is whether we can avoid the tricks presented above by modelling the selection equations as-is. That is, do Bayesian models all us to directly model a) the missing values of Y and b) the effect of Y on observing Y in such a way that recovers the correct value of the effect of X on Y? Given no one (to my knowledge) has taken this approach, I assume that these is something fundamentally wrong with it. But it’s not obvious to me why that would be the case, so I’m curious what others think.

We want to model f(Y|X) but have to contend with g(Z|Y,X) (where Z is the missing data indicator), meaning we have to model the joint distribution h(Y,Z|X). The selection model approach factorizes the joint distribution as h(Y,Z|X) = f(Y|X) g(Z|Y,X), which recovers the model of interest (f(Y|X)) while adjusting for the selection process (g(Z|Y,X)). Say we assume f() is linear, we have \hat{Y} = \alpha_y + X \beta_{xy}. The missing indicator, Z, is binary, so say we use a logistic model, \hat{Z} = logit^{-1}(\alpha_z + X \beta_{xz} + Y \beta_{yz}). If we use the complete data (dropping cases missing Y), then we have no information to estimate \beta_{yz} and, in fact, no variation on Z. A natural way of handling this in a Bayesian model is to treat missing Y and \beta_{yz} as parameters (as in the model below). Alternatively, we can constrain \beta_{yz} to a particular value and conduct a sensitivity test to see how much \beta_{xy} is dependent on the selection assumption.

    int N;

    vector[N] X;
    vector[N] Y;
    array[N] int<lower = 0, upper = 1> Z; // Observed?

    int O; // Number of missing observations
    array[O] int<lower = 1, upper = N> I; // Missing indicators
    // Model for Y
    real alpha_y;
    real beta_xy;
    real<lower = 0> sigma_y;

    vector[O] Y_miss;

    // Model for Z
    real alpha_z;
    real beta_xz;
    real beta_yz;
    // Fill-in estimates of missing values
    vector[N] Y_full = Y;
    Y_full[I] = Y_miss;

    // Priors
    alpha_y ~ normal(0, 10);
    beta_xy ~ normal(0, 10);
    sigma_y ~ normal(0, 10);

    alpha_z ~ normal(0, 2.5);
    beta_xz ~ normal(0, 2.5);
    beta_yz ~ normal(0, 2.5);

    // Likelihood
    Y_full ~ normal(alpha_y + X * beta_xy, sigma_y);
    Z ~ bernoulli_logit(alpha_z + X * beta_xz + Y_full * beta_yz);

The code and plot below shows the performance of estimating \beta_{xy} (\rho, in the code) from 10 unique draws of the parameters, 1000 replications from each parameter set. “Complete” refers to using the complete data. “Optimum” refers to the model where I pass the true value of \beta_{yz}. “Prior” freely estimates \beta_{yz}, but sets the mean of the prior to half of the true value. “Freely estimate” does the same thing but sets the prior mean to 0. And “Listwise” deletes all cases missing Y.

# Simulated data
mvrnorm <- MASS::mvrnarm

nobs <- 250
rho <- runif(1, -1, 1)
alpha_z <- runif(1, -2, 2)
beta_yz <- runif(1, -5, 5)

xy_mat <- mvrnorm(nobs, mu = c(0,0), Sigma = matrix(c(1, rho, rho, 1), nrow = 2))
xy_df <- data.frame(x = xy_mat[,1],
                    y = xy_mat[,2],
                    z = rbinom(nobs, 1, p = plogis(alpha_z + xy_mat[,2] * beta_yz)))

xy_df$y_obs <- ifelse(xy_df$z == 1, xy_df$y, 0) # Treat missing Y as 0
xy_df$z <- c('Missing', 'Observed')[xy_df$z+1]

ggplot(xy_df, aes(y, group = z, fill = z)) +
  geom_histogram(position = 'identity', alpha = 0.5, color = rgb(0,0,0,0)) +
  theme_bw() +
  scale_fill_brewer(palette = 'Set1')

model_list <- list(
  N = nrow(xy_df),
  Y = xy_df$y_obs,
  X = xy_df$x,
  Z = xy_df$z,

  O = nrow(xy_df) - sum(xy_df$z),
  I = (1:nrow(xy_df))[xy_df$z == 0],

The model seems to work in fairly simple tests across a variety of values for the model parameters. Obviously, if you constrain \beta_{yz} to be the correct value, it performs well. That seems to support the idea of using this model for sensitivity analyses (which are common in MNAR settings). But, even when freely-estimated, the model seems to perform well: wider confidence intervals, but good coverage and lower bias and RMSE than listwise deletion.

I want to believe that handling MNAR data with Bayesian models can be this simple, but I’m hesitant before further confirmation. If anyone has some insight, I’d appreciate hearing it. In the meantime, I’ll look for some published cases to compare it to.


Hi, I agree your model looks reasonable. I however don’t see it in such a stark contrast to previous work - notably, the Du et al. model seems almost identical. The only differences I see in Du et al. is that they

a) use a custom Gibbs sampler which they derive (which you don’t need to as Stan is likely to work quite well for this problem) and
b) also explicitly include the “propensities” for missingness (\mathbf{r^*} in the paper) in the model. From briefly reading the paper I am not sure what they gain by this, it appears those can be easily marginalized out as you do. Maybe those are needed to make the Gibbs sampler work, but that’s just a guess.

More generally, I think a potential caveat for this class of approaches is that you still need to assume that the covariates are not missing at all (or missing at random) and you introduce additional assumptions about the mechanism for missingness which might be hard to verify/justify. So it can likely be useful in specific cases, but is also likely not a definitive solution for many/most cases…

Does that make sense?

Good luck with further work on the model!

1 Like

Thank you for the feedback, @martinmodrak . This idea has been rolling around in my head for a while now and just thought it better to throw it out there, so I appreciate the feedback on a more abstract, untested idea. I’m reassured to see that your initial perspective is similar to my own, notably

  1. The model looks reasonable (whether it performs well is another story)
  2. It’s conceptually similar to the model described by Du et al, though (hopefully) more efficient by not having to sample the latent probabilities

Thanks for making this point. I need to look a bit closer at how existing models for MNAR data work, not just their assumptions on the face of it. It’s possible this explicit assumption in my model is stricter than other forms.

And to make something clearer about the motivation, the main appeal of such a model is that it’s really easy to tack on to other models. Say you already have a linear model up-and-running in Stan but become concerned about the missing data process. You could (in theory) just add the model for the missing data indicator and otherwise keep the same model code, assumptions, etc… You wouldn’t need to reformulate the model as in the Heckman model, nor would you need to sample the latent probabilities as in Du et al.'s.