Others have expressed interest in Bayesian models that account for data missing not at random (MNAR, where the probability of missing Y is dependent on the values of Y) both here on Discourse (sere here, here, here, and here) and in recent publications/blogs (see Du et al 2021 and here). The Heckman model solves this issue by modifying the likelihood for Y to account for the selection bias (the correlation between Y and the latent probability of observing Y) while Du and colleagues directly impute the latent probability of observing Y.
My (probably naïve) question is whether we can avoid the tricks presented above by modelling the selection equations as-is. That is, do Bayesian models all us to directly model a) the missing values of Y and b) the effect of Y on observing Y in such a way that recovers the correct value of the effect of X on Y? Given no one (to my knowledge) has taken this approach, I assume that these is something fundamentally wrong with it. But it’s not obvious to me why that would be the case, so I’m curious what others think.
We want to model f(Y|X) but have to contend with g(Z|Y,X) (where Z is the missing data indicator), meaning we have to model the joint distribution h(Y,Z|X). The selection model approach factorizes the joint distribution as h(Y,Z|X) = f(Y|X) g(Z|Y,X), which recovers the model of interest (f(Y|X)) while adjusting for the selection process (g(Z|Y,X)). Say we assume f() is linear, we have \hat{Y} = \alpha_y + X \beta_{xy}. The missing indicator, Z, is binary, so say we use a logistic model, \hat{Z} = logit^{-1}(\alpha_z + X \beta_{xz} + Y \beta_{yz}). If we use the complete data (dropping cases missing Y), then we have no information to estimate \beta_{yz} and, in fact, no variation on Z. A natural way of handling this in a Bayesian model is to treat missing Y and \beta_{yz} as parameters (as in the model below). Alternatively, we can constrain \beta_{yz} to a particular value and conduct a sensitivity test to see how much \beta_{xy} is dependent on the selection assumption.
data{
int N;
vector[N] X;
vector[N] Y;
array[N] int<lower = 0, upper = 1> Z; // Observed?
int O; // Number of missing observations
array[O] int<lower = 1, upper = N> I; // Missing indicators
}
parameters{
// Model for Y
real alpha_y;
real beta_xy;
real<lower = 0> sigma_y;
vector[O] Y_miss;
// Model for Z
real alpha_z;
real beta_xz;
real beta_yz;
}
model{
// Fill-in estimates of missing values
vector[N] Y_full = Y;
Y_full[I] = Y_miss;
// Priors
alpha_y ~ normal(0, 10);
beta_xy ~ normal(0, 10);
sigma_y ~ normal(0, 10);
alpha_z ~ normal(0, 2.5);
beta_xz ~ normal(0, 2.5);
beta_yz ~ normal(0, 2.5);
// Likelihood
Y_full ~ normal(alpha_y + X * beta_xy, sigma_y);
Z ~ bernoulli_logit(alpha_z + X * beta_xz + Y_full * beta_yz);
}
The code and plot below shows the performance of estimating \beta_{xy} (\rho, in the code) from 10 unique draws of the parameters, 1000 replications from each parameter set. “Complete” refers to using the complete data. “Optimum” refers to the model where I pass the true value of \beta_{yz}. “Prior” freely estimates \beta_{yz}, but sets the mean of the prior to half of the true value. “Freely estimate” does the same thing but sets the prior mean to 0. And “Listwise” deletes all cases missing Y.
# Simulated data
library(ggplot2)
mvrnorm <- MASS::mvrnarm
nobs <- 250
rho <- runif(1, -1, 1)
alpha_z <- runif(1, -2, 2)
beta_yz <- runif(1, -5, 5)
xy_mat <- mvrnorm(nobs, mu = c(0,0), Sigma = matrix(c(1, rho, rho, 1), nrow = 2))
xy_df <- data.frame(x = xy_mat[,1],
y = xy_mat[,2],
z = rbinom(nobs, 1, p = plogis(alpha_z + xy_mat[,2] * beta_yz)))
xy_df$y_obs <- ifelse(xy_df$z == 1, xy_df$y, 0) # Treat missing Y as 0
xy_df$z <- c('Missing', 'Observed')[xy_df$z+1]
ggplot(xy_df, aes(y, group = z, fill = z)) +
geom_histogram(position = 'identity', alpha = 0.5, color = rgb(0,0,0,0)) +
theme_bw() +
scale_fill_brewer(palette = 'Set1')
model_list <- list(
N = nrow(xy_df),
Y = xy_df$y_obs,
X = xy_df$x,
Z = xy_df$z,
O = nrow(xy_df) - sum(xy_df$z),
I = (1:nrow(xy_df))[xy_df$z == 0],
)
The model seems to work in fairly simple tests across a variety of values for the model parameters. Obviously, if you constrain \beta_{yz} to be the correct value, it performs well. That seems to support the idea of using this model for sensitivity analyses (which are common in MNAR settings). But, even when freely-estimated, the model seems to perform well: wider confidence intervals, but good coverage and lower bias and RMSE than listwise deletion.
I want to believe that handling MNAR data with Bayesian models can be this simple, but I’m hesitant before further confirmation. If anyone has some insight, I’d appreciate hearing it. In the meantime, I’ll look for some published cases to compare it to.
